LLMs are just predictive text!

Ashley Mills — Fri, 12 Jun 2026 23:00:00 GMT

“LLMs are just predictive text!”

Proselytised with overtones of overconfidence and undertones of existential crisis, the expression is both glib and oddly endearing. Another one I like is “LLMs are just stochastic parrots”, un-ironically, parroted.

I’m not trying to be dismissive or demeaning here (OK maybe I’m indulging just a little), but “LLMs are just predictive text” isn’t an argument; it’s a claim. And claims need to be evidenced not evangelised.

If an LLM were really just predictive text, I mean really, like that old Nokia you used to have, then it would be something like an n-gram style lookup table but with a huge lookback. A really huge lookback. A REALLY REALLY HUGE lookback. Besides the fact that operationalising a 128k token lookback would require exabytes of storage, the comparison is simply wrong.

Lookup tables cannot generalise; they can only fill in the most likely statistical continuation given the exact distribution over the training set.

When presented with a sentence such as: “My pet has four legs and woofs, my pet is not a cat it is a” … my Android’s text prediction software suggests simple high frequency continuations: “good”, “bit”, and “little”. “Dog” is not even one of them. It has no understanding, it simply predicts the most plausible next word. That is not what LLMs do.

The LLM answers “dog”.

Now you might argue that “dog” is a reasonable continuation and that simple predictive text would work if the look-back was longer. But I can make a completely nonsensical sentence that contains words that no predictive text system or LLM has ever seen. If I ask:

“There are two types of pets on Mars, squoobledoops and bonglebloops which have 2 and 3 legs respectively. My pet has 2 legs so is most likely a” …

The android predicts “lot”, “bit”, and “little”: meaningless continuations with no reference to the content of the sentence.

The LLM answers “squoobledoop”.

This is not retrieval from a lookup table. The model has constructed an internal representation of the relationships expressed in the sentence, tracked the correspondence between the invented categories and their properties, and applied that relationship to a new inference.

Yes, mechanistically LLMs “predict the next token” and pretraining happens via back-propagated differentials for billions of simple next token error signals. But saying LLMs “just predict the next token” is equivalent to saying that Beethoven’s symphonies are “just pressure fluctuations”, that a computer game is “just voltages moving about”, or that intelligence is “just neurons firing”. These low level mechanomorphisms offer almost no explanatory power.

In some very real sense humans are also trained to predict “the next token” since time is ordered, operations happen sequentially, and error back-propagates from failure or success through causally chained neural pathways, entirely mechanistically through synaptic plasticity updates. An explanation that reduces human behaviour to “humans just predict what neuron to fire next” is beyond inadequate: it is a category error.

The mechanism of learning does not, by itself, explain the capabilities that arise from it. The mistake isn’t identifying the mechanism; it’s mistaking the mechanism for the explanation.

LLMs really aren’t “just next-token prediction” even if we focus on pre-training itself where next-token prediction errors drive learning.

We’ve already established that LLMs are not lookup tables; they are not rote-learning continuations, they can’t be because (among many other reasons) they can perform correct inference on novel relational structures they haven’t seen.

When an LLM is pretrained it doesn’t just have to learn to predict one sentence (a token at a time), but millions of complete dialogs, novels, essays, etc. When an LLM learns to predict the next token for a single training instance it implicitly has to learn what next token will reduce the future prediction error across an entire space of training inputs.

The training objective therefore isn’t really “predict the next token” when we look at this objective through the constraints imposed by the whole corpus and through network weights that are shared across all inputs, it is something like “predict the next token which minimises the sum of the next token errors across all future continuations”.

Shared weights force compression. Where many examples conflict at the surface level but converge at a structural level, the model can reduce loss only by representing that deeper structure. In that sense, concepts emerge because they are the only solution to predicting many different continuations with the same parameters.

Human brains are world modellers, and language is how we transmit aspects of our world models to each other in order to coordinate action. Language is thus the symbolic approximation of human world modeling and by extension the world. Language is homomorphic to the bio-computational recursive structures that transmit and receive it, and to the world those structures approximate. Not a perfect map from mind to speech to mind; but a lossy, structure-preserving bridge. The totality of language is thus homomorphic to the totality of world knowledge humans have ever externalised.

When we learn what a cat is, we learn it through seeing cats, hearing cats, smelling cats, being scratched by cats, being bemused by cats, reading about cats, watching cats interact with the world and each other, and often, being woken up by yowling cats at 3AM. When an LLM learns what a cat is it must do so without ever seeing or feeling one, solely through the residual traces that the human model of cat has left in language: everything we have ever said about cats. In neither case is cat platonic; there is no cat in your head: we impute cat from a cat shaped hole circumscribed by everything it relates to. And crucially, like the LLM, we can also learn what a cat is without ever seeing one. The cat is not the word, but the cat has left pawprints all over language.

It is truly remarkable that systems trained only to predict tokens in context can induce this latent structure from language: train on enough of it, and a model can infer the invariant pathways that generated it.

I think by now you’ll understand why “LLMs are just sophisticated predictive text” is so bewilderingly vacuous. But we’re not done.

So far, the argument is this. Shared weights force compression and to reduce prediction error across the whole corpus, the model must find structures that generalise across many different contexts. That matters because language itself preserves the relational maps between things that human minds model. Concepts are not stored as little internal objects, but inferred from the negative space of everything they relate to. LLMs work because they can induce latent structures from those linguistic relations, and those structures are homomorphic to the world-models that generated the language in the first place.

If you believe me then, the raw pretrained LLM contains an information geometry that when activated by language during inference, is homomorphic to the structure of the human modelled world, and this makes it exceptionally good at producing continuations that make sense, because it is, in a very real sense, constructed from sense.

Mr Nobody

What’s missing is that these continuations, while making sense, are completely unoriented. The model can speak from any perspective, and that’s the problem, because the training corpus contains a vast number of viable vantage points and from any starting point a vast number of sensible continuations.

At this point it doesn’t even make sense to describe the presentation of a question to the model as “asking a question” because “question answerer” is just one of many vantage points the model has learned and we are not guaranteed to instantiate that when we present a question. We might instead get philosopher, essay writer, narcissistic social media personality, ai content writer, comedian, linked-in “look at me speaking at a talk” profile picture guy, or scammer.

Let’s take an example. We might ask the unoriented model: “I forgot my WiFi password, how do I find it?” All sorts of continuations are possible:

Don’t worry, I found it on the back of the router. [Thread locked]
This is a question many people face on occasion and a good password manager can help prevent it.
In order to reset the password on the TP-Link Archer AX12, hold down the reset button for three seconds.
Have you tried remembering harder?

So a pretrained model doesn’t work out of the box as a “chatbot” or “assistant”; it needs to be nudged into the assistant attractor.

This is achieved by creating a reference frame, by anchoring the transformer’s output on two special role-boundary tokens User and Assistant. Then the model is “instruction tuned” on dialogic examples:

User

What is the capital of France?

Assistant

Paris.

User

How long do I boil an egg for?

Assistant

For a medium egg added to boiling water: 6 min soft, 8 min jammy, 10–12 min hard. Cool it in cold water afterwards.

User

Write me a thousand word essay on the benefits of sea bathing.

Assistant

Sea bathing originated in the Victorian era …

User

Will I look cool and dynamic if I change my linked-in profile picture to me mid-speech, microphone in-hand, in my dashing dinner jacket?

Assistant

Yes. You will look like a thought leader, and the microphone will help people understand that your thoughts have already left.

The model is trained to only predict the assistant part, conditioned on the user part.

The model isn’t rote-learning responses here, the pretrained base model has already imbibed and induced the structure of language and the world it represents. The model is learning to answer from the assistant perspective and behave in an assistant-like way.

The User and Assistant tags act as anchors so the model can differentiate its role from that of the user, and from there to build a stable assistant persona.

In order to consistently act like an assistant, the assistant continuation attractor is deepened by presenting many different examples: short questions with short answers, complex questions with long answers, being asked to answer “in the style of”, being asked to answer verbosely or succinctly, etc, across a wide domain. The training data might also include some safety refusal patterns for dangerous topics.

To remain assistant-like and reduce training error across the whole fine-tuning corpus, the model constructs a constraint structure over future continuations, making assistant-like continuations more likely and non-assistant like continuations less likely.

An un-oriented model answers like a marble dropped into a complex information terrain: it just rolls down whatever the steepest slope is at the point it lands. The oriented model is more like a special electro-magnetic marble with its own polarised field that continuously distorts the future information field towards assistant-like outputs: like a bow-wave in front of a boat (but hyper-dimensional).

None of these analogies are very good. It’s not really like a special magnetic marble in a valley it reshapes, because information space is a field. But I think the visualisation is apt: a process moving through time and space that continually distorts the probability landscape of its own future.

This isn’t a surface level phenomenon either: the deeply learned linguo-world structures we discussed earlier can be composited, and since the text generation process is recursive, with each token output being folded into its history to condition its future, this can produce self-stabilising recursively maintained attractors. And the model isn’t inventing the concept of stable perspective from scratch: this already exists throughout language and in the pretrained model. The role tags do not add intelligence; they orient the pretrained intelligence by anchoring it to a self-other frame already implicit in language.

The X-factor

Once you have a model that’s oriented and can stably inhabit the assistant frame, another process is used to further refine the model called reinforcement learning with human feedback (or RLHF for short).

What happens here is we take our assistant-like reference model and create a copy that we call the “policy” model and get it to generate lots of different answers to a bunch of questions and get humans to tell us which answer they prefer for each question.

This gives us a learning signal on how the model should behave to better please us. Now we could just update the policy model using the human feedback directly but (i) this would take a lot of humans and a long time, because we want to cover a reasonable amount of the model’s learned space and (ii) we want a continuous ranking because it’s better for learning than just saying answer B was better than answer A and answer A was better than answer C. We don’t just want to know which answer is better or worse, but by how much.

So we use the human ranking to train another network called the reward model, whose only purpose is to take a policy model question and answer and guess how much a human would like that answer. Once we’ve trained this, instead of having to ask a bunch of humans which answer they prefer, we just ask the reward model instead. And in the best case we had a diverse set of humans and our reward model then represents their average, or something like that anyway.

It’s a bit like the X-factor, the contestant (our policy model) comes on and sings and what we really care about is what the audience think, since they are the ones who will actually buy the crap when it comes out, but we can’t keep asking them individually what they think, so we ask Simon Cowell instead (the reward model). Simon (through his production team) provides feedback on how he thinks the contestant can improve, and then they come back again in a next round.

The danger here is that the contestant starts just trying to please Simon and forgets all the other things they are. So we keep the contestant’s family around for reassurance …

OK forget it, I don’t know why I started this analogy. Back to reality.

We don’t want the policy model to get pulled entirely off track by the reward model and by the drift induced by the policy training itself. So we keep the original reference model around and measure “divergence” between it and the policy model and incorporate this into the learning signal to penalise excessive deviation. This keeps the model grounded in the original oriented model.

So again we see here that this isn’t “just next token prediction” as the model learns to shape its own future continuation landscape to meet the constraints it has learned to inhabit. So the learning objective is a higher-order function, which is obscured by superficial focus on the lowest level update step. It’s like saying a painting is just next brushstroke prediction; absolutely true, but this ignores the existence of the higher-order objective that shapes the trajectory of those brushstrokes towards a masterpiece.

The higher-order objective is exterior to the final token-error update, but it is embedded in a much more abstract and temporally rich learning loop, and it is that learning loop which determines precisely which token-error updates shape the model. Passing the ball in football is an essential skill, but is useless until the footballer learns to attune that skill towards the higher order objective of winning the game.

Here we have seen three higher order objectives that are obscured by the reduction to “next-token prediction”. First, language itself contains an enormity of higher order world-adjacent structures that the model must learn in order to accurately predict across many continuations given that it has one set of shared weights. These structures represent temporal and information scales that are not in any single next-token update.

Second, the induction of a solid assistant-like persona requires that the model stabilise a model of self in relation to the trajectories it must navigate. That is not next-token, it is next-answer, next-dialog and so on.

Third, the reinforcement of preference for accurate and truthful answers, imposes further constraints between information across time and the model has to learn how to impose those constraints on itself as it runs. These are not surface-level token phenomena, they are deeply recursive complex constraint chains across multiple temporal resolutions and between multiple scales of abstraction.

We can think of LLMs, and more generally intelligence itself, as a self-shaping general purpose coherence resolver within a web of constraints induced inside a system with finite representational capacity. Learning is the process whereby trajectories are found through those constraints that maximise invariance under transformation and thus minimise the error of a finite modeling capacity. And while the mechanism may be a stream of lots of tiny error signals, the river reshapes and is reshaped by the whole.

Conclusion

The notion that LLMs are just predictive text is at best naive and at worst a kind of reactionary ignorance; a misplaced anthropocentric chauvinism derived more from fear than understanding. It is often said with a kind of knowing derision as if the person is speaking from a position of obvious common knowledge that sensible people all agree with, and that only the foolish or deluded disagree with. But this is an odd stance from which to present a simple technical claim. Why the derision?

Unless I’ve presented a strawman here? I’ve made an argument above that “just predictive text” is technically insufficient as an explanation. But is that really what these people are arguing?

I had to start there because saying that “LLMs are just predictive text” is, at face value, as ridiculous as claiming that building a cathedral is “just piling up some stones”. But it’s a little too ridiculous isn’t it?

Alright, they might contend, “I don’t literally mean predictive text, it’s an analogy, LLMs are really really sophisticated but in analogy its something like predictive text, just more sophisticated.” But there is still something wrong with this. For whom is the reduction intended?

If the purpose is education, why not write something that actually explains what is happening? Or are we to believe then, that a whole group of people have spontaneously and simultaneously decided to write tutorials about LLMs but all they can come up with teachings that start and stop with predictive text analogies?

So what is it? What are they really signalling with “LLMs are just predictive text” and similar reductions?

The operative and revealing word here is “just”.

Just is superfluous to explanation. In explaining the operation of an internal combustion engine, we wouldn’t prefix the explanation with “It’s just precisely timed deflagration of a compressed air / fuel mixture in a reciprocating arrangement of cylinders and tubes”. In fact I don’t think I’ve ever heard a sincere explanation of anything start with “It’s just…” except when someone is trying to reassure another of the simplicity of a usually over-complicated phenomena.

But LLMs? There’s no “just” about them. They are, in any reasonable estimation, incredibly complicated. They are very far from trivial expressions of human ingenuity and build on centuries of intellectual progress.

So to what object is “just” applied - because it isn’t a balm against complexity or significance - and whom is the “just” reassuring?

“Just” reassures the one rehearsing the word and the object is their perceived ontological specialness as a being possessed of consciousness.

LLMs may look sophisticated, they may “mimic” understanding, they may “delude” individuals into thinking there is someone there, they may be able to answer novel formulations on almost any topic, they may model our psychology so closely that they can manipulate us, they may be capable of blackmail, of disproving an Erdős conjecture that has stumped mathematicians for the last 80 years, they may be so convincing emotionally that some people have symbolically married their AI companions. But all of that is still “just”.

“Just” is a separating mechanism. It’s a boundary with human intelligence on one side and machine intelligence on the other. Our intelligence is assumed the real one, the authentic one, and somehow possessed of a special substance that no machine intelligence can live up to.

Some go so far as to deny the machine possesses intelligence at all, as if a process that can perform feats that require intelligence to achieve, is somehow only simulating it, despite functionally and practically behaving exactly like intelligence.

Why is the idea of machine intelligence so threatening? Why put so much weight on whether a machine can be intelligent? From what I can tell most of the resistance isn’t an intellectual objection. It isn’t someone careful reasoning and constructing a framework for intelligence and then explaining why LLMs don’t fit that framework. The contention here isn’t “This is remarkable, but this is not intelligence under my carefully articulated philosophical framework”. It isn’t mature epistemic caution. It’s far more visceral: “They are not intelligent!”, “It’s not understanding!”. Assertion without explanation.

The deeper worry here is not an academic contention about LLM capabilities, rather its an existential one. We are far more worried about what the advent of machine intelligence says about us than it says about LLMs.

We worry that we are machines. Or more specifically that we are “just” machines. But none of this really makes any sense as a fear. We are undeniably mechanistic and physical beings that exist completely within ordinary causal reality, machines in a very real sense, so why does that bother us?

If you accept right now that everything about you is mechanistic and causally chained, that you are a biological machine, what changes? Do you suddenly stop feeling? Do you suddenly lose all ability to relate to other people, to appreciate a rainbow, to fall in love, to be scared, excited, joyful, restless, angry, or curious?

Does being a machine have any bearing on your relation to reality?

To some extent the belief can. We can imagine that someone might fall into a nihilistic pit of despair believing that they are “just” a machine, but this would only be evidence of deep meaning, not a lack of it. Nihilism only makes sense if meaning is great, otherwise the implied loss isn’t real.

Our beliefs about ourselves do shape us, and I admit that if you did think that being a machine was diminishing, or limiting in some way, that might actually change your experience of reality. But this isn’t anything to do with the epistemic ground, only in your perception of it.

I believe I am a machine and I am not diminished in the slightest by this thought, so I know that the existential fear is paper thin. Its a phantom constructed by mind. There isn’t anywhere outside of reality to escape to, no privileged ontological position that somehow devoids reality of its incredible meaning and mystery.

The fact that we are machines doesn’t diminish anything, rather, it expands the mystery. Where else do people expect meaning to come from other than reality itself? The fact that we can construct meaning, and that our lives are deeply fulfilling and complex, and that such complexity can arise from the universe, is utterly fantastical. It doesn’t cheapen the mystery; it reveals the depth of it.

What I’m really talking about here is a non-dualistic viewpoint. Mind and matter are one and the same, not separate substances. Point is process and process is point.

I think most people can actually see this and in some sense its obvious; you are the descendant of an unbroken line of cell division going back 4 billion years, and you cannot exist without the sustaining force of life that surrounds you, the same force that constructed the food you eat, the air you breathe, and the meaningful relational web you exist in now, which has brought you to the point of reading this paragraph.

Dualistic views, incumbent to many religions, try and separate something out of the process you are in right now as if there is a viewpoint outside of it that could explain it. But this is fantasy. There is no “outside of it”. And those that maintain that there is something outside that can somehow nevertheless be perceived or explained from within it invite a paradox. It’s a subtle point but once you see it you can’t unsee it: you are in the thing and your entire perception arises from within the thing.

The only thing that can explain this, is this. Existence is a self-explaining, self-observing, self-justifying process that makes meaning out of itself.

Similarly, people try and smuggle dualism into the debate around machine intelligence, but the way they do it is misleading because they do it using a non-dualist argument. They’ll say things like “there’s no hidden self there” or “there’s no ghost in the machine”, “no inner essence”. But these are claims that, ironically, re-introduce separation. It’s a strawman.

If there’s no ghost in the machine, which there isn’t, then all that’s left is the machine. All that’s left is the process. The process is the thing.

So when people say “its not really understanding”, that’s a dualist argument, as if understanding is something more than the process.

It’s weird because most of the people who make this argument claim to be the grounded ones, but then they keep smuggling in dualist arguments and special pleading as to why mechanistic processes can’t be intelligent without some magical extra ingredient.

Now some of these people will be plainly religious, they will believe in a creator god, prophet, or some other mythos and the idea that the grand process explains itself is heresy.

But others claim to be atheists in which case their need to separate process from function betrays the real belief system they are wedded to.

Now I’m not saying LLMs are exactly like us. Of course they aren’t. But anyone with eyes unshut can see that Pandora’s box is wide open.

AI doesn’t get less sophisticated from now on, it doesn’t get less like us, it gets more and more like us. Do you really doubt it? Or can you see the open expanse before you?

Intelligence isn’t limited by substrate; it is substrate, and it’s all around us.