You can’t imitation-learn how to continual-learn
In this post, I’m trying to put forward a narrow, pedagogical point, one that comes up mainly when I’m arguing in favor of LLMs having limitations that human learning does not. (E.g. here, here, here.)
See the bottom of the post for a list of subtexts that you should NOT read into this post, including “…therefore LLMs are dumb”, or “…therefore LLMs can’t possibly scale to superintelligence”.Some intuitions on how to think about “real” continual learning
Consider an algorithm for training a Reinforcement Learning (RL) agent, like the Atari-playing Deep Q network (2013) or AlphaZero (2017), or think of within-lifetime learning in the human brain, which (I claim) is in the general class of “model-based reinforcement learning”, broadly construed.
These are all real-deal full-fledged learning algorithms: there’s an algorithm for choosing the next action right now, and there’s one or more update rules for permanently changing some adjustable parameters (a.k.a. weights) in the model such that its actions and/or predictions will be better in the future. And indeed, the longer you run them, the more competent they get.
When we think of “continual learning”, I suggest that those are good central examples to keep in mind. Here are some aspects to note:
Knowledge vs information: These systems allow for continual acquisition of knowledge, not just information—the “continual learning” can install wholly new ways of conceptualizing and navigating the world, not just keeping track of what’s going on.
Huge capacity for open-ended learning: These examples all have huge capacity for continual learning, indeed enough that they can start from random initialization and “continually learn” all the way to expert-level competence. Likewise, new continual learning can build on previous continual learning, in an ever-growing tower.
Ability to figure things out that aren’t already on display in the environment: For example, an Atari-playing RL agent will get better and better at playing an Atari game, even without having any expert examples to copy. Likewise, billions of humans over thousands of years invented language, math, science, and a whole $100T global economy from scratch, all by ourselves, without angels dropping new training data from the heavens.
I bring these up because I think the LLM-focused discourse sometimes has far too narrow a notion of what problem “continual learning” is supposed to be solving. They tend to think the problem is about “losing track of information”, not “failing to build new knowledge”, and they propose to solve this problem with strategies like “make the context [window] longer” (as Dario Amodei recently mused), or better scratchpads with Retrieval-Augmented Generation (RAG) etc.
But real “continual learning” also includes the ways that AlphaZero changes after a million games of self-play, or the ways that a human brain changes after 20 years in a new career. There is no system of scratchpads that you can give to a 15-year-old, such that it would be an adequate substitute for them spending the next 20 years growing into a 35-year-old world expert in some field. Likewise, there is no context window that can turn GPT-2 into GPT-5.
Suppose you took an actual “country of geniuses in a datacenter”, completely sealed them from the outside world, and gave them a virtual reality environment to hang out in for the equivalent of 100 years. What would you find when you unsealed it? There would be whole new ways of thinking about the world and everything in it—entirely new fields of science, schools of philosophy, and so on.
Can a bunch of LLMs do that? Well consider this thought experiment: suppose you take a whole new field of science, wildly different from anything in the training data, and put a giant textbook for this field purely in an LLM context window, with no weight updates at all. Will this LLM be able to understand, criticize, and build on this field? My opinion is “absolutely not” (see 1, 2) which implies that merely increasing context lengths is definitely not sufficient for a real “country of geniuses in a datacenter”, when the datacenter is sealed shut for the equivalent of 100 years (contra Dario who seems to think that it’s at least in the realm of possibility that more context is sufficient by itself to get continual learning at “country of geniuses” level).
(If we’re talking about what a sealed “country of human geniuses” could do over the course of, like, one minute, rather than over the course of 100 years, then, yeah sure, maybe that could be reproduced with future LLMs! See von Oswald et al. 2022 on how (so-called) “in-context learning” can imitate a small number of steps of actual weight updates.[1])Why “real” continual learning can’t be copied by an imitation learner
Now, suppose that I take a generic imitation-learning algorithm (e.g. self-supervised learning in a transformer-architecture neural net, just like LLM pretraining), and have it watch a deep Q network play Atari Breakout, as it starts from random initialization, and gets better and better over 1M iterations. OK, now we have our trained imitation-learner. We freeze its weights, and use it in a similar way as people traditionally used LLM base models, i.e. have it output the most likely next move, and then the most likely move after that, etc.
Question: Is this trained imitation-learner actually a good imitation of the deep Q network? Well, “good” in what respect? I would pull apart a couple topics:Snapshot imitation: The actual deep Q network, right now, at the moment training is done, would output such-and-such Breakout moves in such-and-such positions. Question: Will the trained imitation-learner output similar moves right now, thus playing at a similar skill-level as the teacher? My answer is: plausibly yes.Imitation of long-term learning: The actual deep Q network, if it kept playing, would keep improving. Will the trained imitation-learner likewise keep improving over the next 10M moves, until it’s doing things wildly better and different than anything that it saw its “teacher” deep Q network ever do? My answer is: no.Imitation of long-term learning (example 2): The actual deep Q network, if it were suddenly transplanted into a new game environment (say, Atari Space Invaders), would start by making terrible moves, but over 10M iterations it would gradually improve to expert level. Will the trained imitation-learner likewise do 10M iterations and then wind up performing expertly at this game, a game which it never saw during its training phase? My answer is: no.
Why not? Well, actually, for an ideal imitation learning algorithm, i.e. Solomonoff induction on an imaginary hypercomputer, my answers would all be “yes”! But in the real world, we don’t have hypercomputers!
These days, when people talk about imitation learning, they’re normally talking about transformers, not hypercomputers, and transformers are constrained to a much narrower hypothesis space:
Imitation-learning a deep-Q RL agent by Solomonoff induction
Imitation-learning a deep-Q RL agent by training a transformer on next-action prediction
Hypothesis space
The set of all computable algorithms
A forward pass through T, for the set of all possible trained transformers T
Ground truth
The actual deep-Q RL agent, with such-and-such architecture, and Temporal Difference (TD) learning weight updates, etc.
The actual deep-Q RL agent, with such-and-such architecture, and Temporal Difference (TD) learning weight updates, etc.
Asymptotic limit
It converges to the actual deep-Q RL agent
It converges to whatever trained transformer forward pass happens to be closest to the actual deep-Q RL agent
I think we should all be very impressed by the set of things that a transformer forward pass[2] can do. But we should not expect a transformer forward pass to reproduce a full-fledged, entirely different, learning algorithm, with its own particular neural network architecture, its own particular methods of updating and querying weights, etc., as it runs and changes over millions of steps.
Running one large-scale learning algorithm is expensive enough; it’s impractical to run a huge ensemble of different large-scale learning algorithms in parallel, in order to zero in on the right one.[3]
I’m going to harp on this because it’s a point of confusion. There are two learning algorithms under discussion: the imitation-learning algorithm (e.g. a transformer getting updated by gradient descent on next-action prediction), and the target continual learning algorithm (e.g. a deep Q network getting updated by TD learning). When the imitation learning is done, the transformer weights are frozen, and the corresponding trained model is given the impossible task of using only its activations, with fixed weights, to imitate what happens when the target continual learning algorithm changes its weights over millions of steps of (in this case) TD learning. That’s the part I’m skeptical of.
In other words: The only practical way to know what happens after millions of steps of some scaled-up continual learning algorithm is to actually do millions of steps of that same scaled-up continual learning algorithm, with actual weights getting actually changed in specifically-designed ways via PyTorch code. And then that’s the scaled-up learning algorithm you’re running. Which means you’re not doing imitation learning.
So back to the human case: for a typical person (call him “Joe”), I think LLMs are good at imitating “Joe today”, and good at imitating “Joe + 1 month of learning introductory category theory”, but can’t imitate the process by which Joe grows and changes over that 1 month of learning—or at least, can’t imitate it in a way that would generalize to imitating a person spending years building a completely different field of knowledge that’s not in the training data.Some things that are off-topic for this post
As mentioned at the top, I’m hoping that this post is a narrow pedagogical point. For example:I’m not commenting on whether it’s possible to modify LLM post-training into a “real” continual learning algorithm (although I happen to believe that it isn’t possible).I’m not commenting on how an inability to do “real” continual learning cashes out in terms of real-world competencies (E.g., can a non-“real”-continual-learning AI nevertheless take jobs? Can it kill billions of people? Can it install itself as an eternal global dictator? Etc.) (I happen to think that these are tricky questions without obvious answers.)I’m not commenting on whether we should think of actual frontier LLMs (not just pretrained base models) as predominantly powered by imitation learning, even despite their RL post-training (although I happen to believe that we probably should, more or less (1,2)).^
I guess I also need to mention the “algorithmic distillation” paper (Laskin et al. 2022), but I’m hesitant to take it at face value, see discussion here.^
You can replace “a forward pass” with “10,000 forward passes with chain-of-thought reasoning”; it doesn’t change anything in this post.^
Outer-loop search over learning algorithms is so expensive that it’s generally only used for adjusting a handful of legible hyperparameters, not doing open-ended search where we don’t even vaguely know what we’re looking for. Even comparatively ambitious searches over spaces of learning algorithms in the literature have a search space of e.g. ≈100 bits, which is tiny compared to the information content of a learning algorithm source code repository.
Discuss