LLMs aren’t world models
I believe that language models aren’t world models. It’s a weak claim — I’m not saying they’re useless, or that we’re done
milking them. It’s also a fuzzy-sounding claim — with its trillion weights, who can prove that there’s something an LLM isn't a
model of? But I hope to make my claim clear and persuasive enough with some examples.
A friend who plays better chess than me — and knows more math & CS than me - said that he played some moves against a
newly released LLM, and it must be at least as good as him. I said, no way, I’m going to cRRRush it, in my best Russian accent.
I make a few moves – but unlike him, I don't make good moves, which would be opening book moves it has seen a million times; I make weak moves, which it
hasn't . The thing makes decent moves in
response, with cheerful commentary about how we're attacking this and developing that — until about move 10, when it tries to
move a knight which isn't there, and loses in a few more moves. This was a year or two ago; I’ve just tried this again, and it
lost track of the board state by move 9.
When I’m saying that LLMs have no world model, I don’t mean that they haven't seen enough photos of chess knights, or held a
knight in their greasy fingers; I don’t mean the physical world, necessarily. And I obviously don’t mean that a machine can’t
learn a model of chess, when all leading chess engines use machine learning. I only mean that, having read a trillion
chess games, LLMs, specifically, have not learned that to make legal moves, you need to know where the pieces are on
the board. Why would they? For predicting the moves or commentary in chess games, which is what they’re optimized for,
this would help very marginally, if at all.
Of course, nobody uses LLMs as chess engines — so whatever they did learn about chess, they learned entirely “by accident”,
without any effort by developers to improve the process for this kind of data. And we could say that the whole argument that
LLMs learn about the world is that they have to understand the world as a side effect of modeling the distribution
of text — which is soundly refuted by them literally failing to learn the first thing about chess. But maybe we could
charitably assume that LLMs fail this badly with chess for silly reasons you could easily fix, but nobody bothered. So let’s
look at something virtual enough to learn a model of without having greasy fingers to touch it with, but also relevant enough
for developers to try to make it work.
So, for my second example, we will consider the so-called “normal blending mode” in image editors like Krita — what happens when you put a layer with some partially transparent pixels on top of another
layer? What’s the mathematical formula for blending 2 layers? An LLM replied roughly like so:
In Krita Normal blending mode, colors are not blended using a mathematical formula. The "Normal" mode
simply displays the upper layer's color, potentially affected by its transparency,
without any interaction or calculation with the base layer's color. (It then said how other blending modes
were different and involved mathematical formulas.)
This answer tells us the LLM doesn't know things such as:
- Computers work with numbers. A color is represented by a number in a computer.
- Therefore, a color cannot be blended by something other than a mathematical formula — nor can it be “affected” without a
“calculation” by transparency, another number.
- “Transparency” is when you can see through something.
- “Seeing” works by sampling the color at various points, and processing that signal.
- Therefore, if you can see something through something, like, say, a base layer through an upper layer, then by definition,
the color you will see is affected not only by the color of the upper layer and its degree of transparency, but also by the
color of the base layer — or you wouldn’t be seeing the base layer, which means that the upper layer is not at all
transparent, because you’re not seeing through it.
I mean, it sounds stupid to break it down like that, but I’m not wrong, am I? It really doesn’t know any of these things,
does it.
Can you prompt the LLM to explain alpha blending properly?
Sure. But that just shows the LLM knows to put the words explaining it after the words asking the question. This capability does
not make the answer above into lesser evidence of the LLM not knowing the things as opposed to the words.
And of course people can be like that, too - eg much better at the big O
notation and complexity analysis in interviews than on the job. But I guarantee you that if you put a gun to their head or
offer them a million dollar bonus for getting it right, they will do well enough on the job, too. And with 200 billion
thrown at LLM hardware last year, the thing can't complain that it wasn't incentivized to perform.
Of course, these are simple examples. An LLM triumphalist will observe that they often stop reproducing; an LLM denialist
will assume they stopped reproducing through some conspiracy, like a chess engine tool having been given to the LLM, or it
having been drenched with synthetic data similar to your question. (I used to ask LLMs to prove 2+2=4; they'd very pompously
enumerate various notable properties of 2 and 4, and proudly declare that 2+2 must equal 4 based on these properties, and I had
a good laugh. Then LLMs were flogged to become “good at math,” and now they might say something about “Peano axioms,” and some
total garbage about set theory — but they emit enough S(S(2)) and such that it probably counts as a proof, though I am yet to
see the simple “2+2 = 2+(1+1) = (2+1)+1 = 3+1 = 4” which I’d expect from an entity understanding the question.)
For a more complex example, we can take associativity (which, as we’ve seen in 2+2=4, LLMs understand vaguely at best),
combine it with alpha blending and transparency (which apparently they don’t understand at all), and see how well LLMs do. I’ve
had an exchange with an LLM asking whether alpha blending, as implemented in commonly used libraries, is associative, or whether
it isn’t due to precision loss or whatever — and if it’s not associative, how does caching work in drawing programs (where the
program must be precomputing the blending of the layers above and below the currently edited one, to avoid recomputing the
blending of 10 or 100 layers upon every brush stroke.)
Sure enough, it said that alpha blending wasn’t associative — probably because I suggested that it might not be — and that
this is “solved with caching instead of mathematical elegance” — probably because I suggested that caching was involved. And
then I ask, but how can caching work if blending is not associative? If layer 6 is selected, and you blend the cached blending
of {1…5}, the selected layer 6, and the cached blending of {7…10}, you would get different results from blending {1…4}, 5, and
{6…10}, if blending is not in fact associative? And then if you selected layer 5 in the program, you would see a different
picture compared to selecting layer 6 - but in practice you see the same picture?
“You got me,” says the LLM, more or less. So their not knowing what any of the words actually mean very much does extend to
complex examples.
You could say that the LLM was a victim of its agreeableness, since it might have been influenced by my contradictory implications that blending might
not be associative, yet caching must be implemented that counts on it being associative. I could say that, well, my whole
question was about which parts of my suspicions are incorrect, and saying they’re all correct is an abject failure — but let’s
assume it could be a character flaw more than an intellectual weakness. So in our last example, we’ll see the LLM having its own
opinion and sticking to it, despite being told repeatedly that it can’t be true.
I ask it about the thread safety of appending to a Python list from multiple threads, and whether I can tell the number of
times append was called with len(myList), and whether it will work once the GIL is removed.
It says that without the GIL, the program could corrupt memory. I say, no way, this is not C, it must be more like Java? And it
goes, no, CPython is a C program, and without the GIL your racy code can crash like C does. Java is different, it
has a memory model, and look at these crash reports from GIL-less Python. And I’m like, but these are bug reports, it’s not
by design, is there evidence that this is by design? — and it goes, it’s too early for the kind of evidence you’re
looking for to exist, no-GIL is too new, but here’s how a C program could crash in such scenarios… and on and on and on.
It does not know that (pure) Python is a memory-safe
language, and that no suggestion making it memory-unsafe would ever be accepted, and I found no way of persuading it to take
this notion into account — or to acknowledge that the evidence it’s citing in support of its claims is more like evidence to the
contrary (if all the crashes upon races you find are bug reports, it points to the requirement being that races don’t lead to
crashes.)
So it can be either kinda agreeable or very stubborn — and it might obviously not know what it just said in both modes.
Can this be quantified?
I don't see how.
I mean, I wish it could be. It's clear that LLMs do learn some things about the world. For instance, even just the
token embeddings contain the representation of the concept of gender learned without any specific effort to teach the model what
gender is, as evidenced by “king - man + woman ~= queen” in the embedding space.
Ideally, you would want to quantify "how much of the world LLMs model." But even if you resolve the difficulty of defining
what this means, you'll run into the ease with which LLMs memorize answers to specific questions, so the vendor can celebrate
the new bar having been cleared.
All I can confidently claim is that they don't learn a world model except by accident, and there's neither a theoretical
reason nor empirical evidence for your being able to count on this accident in any defined and broad set of circumstances.
So-called conclusions
A guy who made $100 million from being an early employee of some startup came to give a lecture for that startup, and said “a
fundamentally incorrect approach to a problem can be taken very far in practice with sufficient engineering effort.” (He then
cheered up his listeners, most of whom had $100 million less than him, with the addendum “That's what I think is happening in
this company!”)
It is therefore not one of my conclusions that you can’t take LLMs very, very far just because they demonstrably do not learn
a model of the many worlds described by the words they’re trained on (which, BTW, is exactly as it says on the tin; nobody ever
called them LWMs.) I will, however, predict a few things — something you shouldn’t do if you don’t want to look stupid in the
future, but here goes.
There will be at least one big breakthrough in machine learning around “world models”. I have no idea what
this breakthrough will look like; I predict that it will happen because some important kinds of thinking cannot be done without
it, and I trust the Church-Turing thesis when it comes to these kinds of thinking, and I think someone will figure this out,
same as people have come up with deep learning, convnets and transformers. And of course you already have “world models”, such
as systems recovering object classes and positions from images — by a breakthrough, I mean a “generic” ability to build models
of “novel worlds” (even if the model isn’t as good as a specially tailored one), much like you throw any text into an LLM and
have it learn “something” without much tuning for this kind of text.
(In fact, I would guess there will be at least 2 more breakthroughs, the other one being around needing far less training
data — again, not because I know how machines could use less training data, but because I know you and I get by with less.
Feeding these algorithms gobs of data is another example of how an approach that must be fundamentally incorrect at least in
some sense, as evidenced by how data-hungry it is, can be taken very far by engineering efforts — as long as something is useful
enough to fund such efforts and isn’t outcompeted by a new idea, it can persist.)
LLMs are not by themselves sufficient as a path to general machine intelligence; in some sense they are a distraction
because of how far you can take them despite the approach being fundamentally incorrect. This should make “AI risk”
people happy; but “AI risk” is its own hilarity best left to another time.
LLMs will never manage to deal
with large code bases “autonomously”, because they would need to have a model of the program, and they don’t even learn
to track chess pieces having read everything there is to read about chess.
LLMs will never reliably know what they don’t know, or stop making things up. You need some sort of a world
model to have notions of knowledge, truth and falsehood. Any mechanism that is supposed to make LLMs “safe”, trustworthy or
other such is a mix of snake oil and honest efforts to somehow steer it away from text that spooks users — which can be done,
since users are spooked by form more than substance. For example, when it says some politically related nonsense, people drag it
for having the wrong politics, and you can “fix” it by making its output less politically charged — without making it less
nonsensical, which there’s no way to reliably achieve.
LLMs will always be able to teach a student complex (standard) curriculum, answer an expert’s question with a useful
(known) insight, and yet fail at basic (novel) questions on the same subject, all at the same time. This is not
surprising — this is exactly what you would expect from a language model that isn’t a world model. This fuzzy insight is more
cute than useful, however, since it’s hard to know what is and is not novel — in part because you come to the LLM in the first
place with things you don’t already know everything about.
(Some sources, such as this
wonderful writeup about LLMs not knowing that rotating a tic-tac-toe board doesn’t change the game, take this point to its
logical conclusion: “If you know the answer, you don’t need to ask an LLM; if you don’t know the answer, you can’t trust an
LLM.” But this conveys a true insight into LLMs together with unwarranted pessimism about their utility. In fact, sometimes
you know the answer, but it’s quicker to proofread an LLM’s output than to type it out; sometimes you don’t know the answer, but
know to check if the LLM’s answer is correct, etc. etc.)
LLM-style language processing is definitely a part of how human intelligence works — and how human stupidity works.
I agree with Dijkstra that “can machines think?” and “can submarines swim?” are poor questions to ask, and I hate it
when people say that neural networks “work like the brain” and such. But I can’t help feeling that LLMs are a mirror into a part
of how people such as myself think — and I don’t like what I’m seeing in that mirror. “Thinking” by guessing what words to say
next based on words we’ve previously heard might actually help find a good idea — and it’s also how know-nothings get through
work meetings, and how people come to think they know stuff they really don’t, and how they internalize the stupidest notions. I
am starting to think that in today’s environment, high cognitive skills are an actual risk factor for stupidity, and that
learning words without learning a model of what they refer to is one big part of the problem.
P.S. I wish I could say something about how to best use LLMs for programming — something I would like to be
qualified to speak about, and that I am kinda supposed to learn enough to be qualified to speak about. I don’t think I am; I can
only say that I tried Cursor and it failed every time, including at replacing f(obj) and g(obj) with obj.f() and obj.g() (it
would occasionally mix up f and g, and I got tired of reviewing its output), and I went back to simply copying code into and out
of chat windows. I would say that I use LLMs like I use SIMD — sometimes it’s a good fit for leaf functions whose behavior is
relatively easy to specify and test, and it has no business being anywhere else.
I have conflicting theories about why some people do great things with “agentic AI” while I think it’s hopelessly useless for
me; I am waiting for someone to write something crisp and well-researched about this to teach me the truth, or a useful
approximation. I console myself with the idea that I can’t be missing out on too much, given how terrible the output I’m getting
from LLMs often is.
P.P.S. Here’s a somewhat rough Soviet-era joke about a school for kids with special needs that illustrates
my point. Rachel says that a Russian joke isn’t a joke, but a story of pain. To this I reply that some people like them.
An inspector comes to a school for kids with developmental issues. He asks a kid riding a wooden horsie his name, and the kid
says “MMMM.” He says, what do you want to be when you grow up? — and the kid says “MMMM.” The inspector turns to the principal
and says, “you’re doing nothing for these kids. I’ll be back in a month — if there’s no improvement, we’ll close the
school.”
He comes back a month later and finds the kid swinging on the wooden horsie, same as last time; if you want to tell this
masterpiece of Soviet humor at parties — the perfect conversation starter — you should be swinging wildly when saying the kid’s
lines:
— What’s your name?
— MMMMikey!!
— Mikey? Nice to meet you, Mikey. What do you want to be when you grow up,
Mikey?
— MMMMastrounaut!!
— An astronaut? Good stuff, good stuff! And how old are you?
— MMMMikey!!
The moral of the story being that you can learn to predict the next word without learning much about the world — at least up
to a point.
Thanks to Dan Luu for reviewing a draft of this post.
There is a lot of research on this topic - it is a whole subfield
within interpretability. I suggest reading some of this research rather
than making a big blog post based on vibes with no experiments or
empirical results.
It's a small blog post, not a big one... with a correspondingly small
number of "experiments and empirical results."
I suggest making an argument rather than implying that it exists.
Best explanation of LLM limitations I've read so far. Can we please
just admit that this LLM madness is an improved google search? Why is
all this hype needed? What is the point of making people nervous and
anxious about their careers? You want the truth about how productive AI
makes you as a developer? Check out the reddit threads. Slight
performance boost at best. Honestly getting sick of it ...
Reminds me of how a text-to-image model knows to include a reflection
if you ask for a mirror, but it has no understanding of the laws of
physics (or of the 3D content it's rendering as a 2D picture) to produce
one that makes sense.
I'm not entirely convinced by your assumption that we'll find models
that will learn with less training data because "you and I get by with
less" (it might happen, but not for that reason in my opinion).
Everything we've perceived (with our eyes or our ears) since we were
born is our training data, and it's enormous. Which could explain our
ability to extrapolate from smaller data sets as adults. Maybe an LLM
needs that much training data because it's starting from scratch, like a
human baby needing years of learning before it's able to talk, reason,
and perform complex operations confidently.
Thanks for the blog, great read.
Probably would add (yet) to the title. Also suggest looking at KL
divergence and how the loss function is calculated. It’s a rabbit hole,
but tell us mathematically LLMs are seeking to compress the world in a
smaller space
Model: GPT-5 free
Prompt: "hey there, here's a link to a long interesting blog post,
read it, understand it, then tell me what do you about it as a response
to the author of that blog post. link:
https://yosefk.com/cgi-bin/comments.cgi?post=blog/llms-arent-world-models#comments"
Mode: Think Longer
Answer: "Hi Yosef — thanks for a sharp, entertaining, and clear
takedown. I read your post carefully and I want to (1) paraphrase what I
took away, (2) say what I agree with, (3) offer a few constructive
pushes and concrete experiments that could help make "how much of a
world model?" less fuzzy, and (4) suggest some relevant literature and
directions that line up with your intuition. Short paraphrase of your
argument You argue that large language models (LLMs) are not world
models in any reliable, general sense: they often fail at basic
state-tracking and grounded reasoning (chess piece positions,
alpha-blending math, thread-safety reasoning), produce confident but
wrong-sounding answers, and only "know" what they’ve memorized or can
pattern-match from text. Because of this, they can be extremely useful
in many narrow, text-like tasks while still being unable to do kinds of
thinking that require an explicit, manipulable model of a world. You
conclude that LLMs will be taken a long way by engineering but are not
by themselves the route to general intelligence; a true breakthrough in
world-modeling (and probably in data efficiency) is still needed. What I
strongly agree with • The concrete examples you used are persuasive.
They illustrate a gap between fluent surface behavior and reliable
grounded reasoning: fluent text + hallucinated or inconsistent
internals. That gap explains both the impressive utility and the
worrying brittleness of LLM outputs. • Treating an LLM’s ability to
“explain” something as evidence that it understands the underlying
causal / numerical relationships is hazardous — explanations can be
reconstructed from patterns without an internal causal model.
Constructive pushes / clarifications • “Is it a world model, or a very
big pattern recognizer?” — this framing helps. If we define a world
model as a system that has an internal state-space it can simulate, and
which updates consistently under actions/interventions, then the kinds
of failures you show (chess state loss, inconsistent compositional math)
are exactly what we should expect from a pure sequence-prediction
learner trained only on static text. Put another way: absence of
persistent, causally-structured latent state explains a lot of the
observed behavior. (Relatedly, classic model-based RL work explicitly
trains such latent dynamics.) • Where LLMs help anyway — and why that’s
not contradictory: LLMs are extremely effective at pattern completion in
the manifold of human text and code. For standard curriculum problems or
commonly-seen engineering idioms, surface fluency + memorized patterns
can be more efficient than a brittle symbolic system. That’s why they
often “teach” or accelerate work, even while failing novel, stateful
reasoning. Your “SIMD for leaf functions” analogy is apt. Concrete
experiments / measurements you (or others) might run (These are
practical, reproducible tests to quantify “world-modelness” rather than
rely on anecdotes.) • Systematic state-tracking (chess-like) benchmark.
Create many short games where you intentionally make off-book / rare
moves, and measure the model’s legal-move detection and
board-reconstruction accuracy by move index (e.g., error rate at move 5,
10, 15). That quantifies how fast state coherence degrades. (You did
this informally; formalize it.) • Counterfactual / intervention tests.
Give the model a text history, then tell it an intervention happened
(e.g., “the knight on g1 was removed at move 7”), and test whether it
updates all dependent claims consistently. If it cannot reliably
propagate such interventions, that’s evidence against an internal causal
state. (This is directly probing model causal updating.) • Associativity
/ algebraic identity probes. For examples like alpha-blending and
associativity, build small synthetic corpora where the only way to
answer correctly is to apply number-level reasoning or derive algebraic
identities — not to parrot memorized prose. Measure failure modes and
see if fine-tuning on algorithmic data fixes them (it often helps only
locally). • Epistemic uncertainty calibration. Ask models to return
calibrated confidences and run them on in-distribution vs
out-of-distribution (OOD) tasks; measure whether confidence correlates
with correctness. Lack of epistemic awareness is precisely the problem
you describe (“they don’t reliably know what they don’t know”). Relevant
literature / approaches that complement your view • The “world models”
line in model-based RL explicitly trains a compact latent dynamics model
and uses it for planning and imagination (Ha & Schmidhuber 2018;
PlaNet/Dreamer family by Hafner et al.). Those works show that learned
latent dynamics can be useful for control and planning, and they’re the
kind of architecture that actually embodies an internal simulable state
rather than just text-pattern completion. If the goal is genuine world
modelling (even of virtual worlds), this line is directly relevant. •
Judea Pearl’s work on causality is a useful conceptual lens:
distinguishing correlations (what text patterns give you) from models
that support interventions and counterfactuals (what you need for stable
reasoning about “what if” and for knowing what you don’t know). Building
machines that can represent and manipulate causal models is a plausible
route to the capabilities you say LLMs lack. Short rejoinder to possible
pushback Some will say “tooling + retrieval + chain-of-thought + agents
fixes these issues.” That helps in practice for many tasks, but it’s
still a bandage: external tools can supply state, search or ground
truth, while prompting can coax correct derivations — but none of those
alone make the base LLM reliably possess an internal, generalizable
world model. Your point is that we should be honest about where the
holes are and not pretend that surface fluency equals deep, manipulable
understanding. Final note — a possible paper / writeup idea You have
strong, clear examples. Packaging them into a short empirical paper —:
(a) formalized benchmarks (state-tracking, intervention, algebraic
associativity, epistemic calibration), (b) results across model sizes
and training regimes (vanilla LMs vs RL world-models vs multimodal
grounded models), and (c) a taxonomy of failure modes — would be a
valuable contribution. It would turn your persuasive intuition into
quantitative evidence the community can act on. For readers who believe
“scale will fix it,” the benchmarks force the debate into data. Thanks
again for the lucid post — it’s the kind of reality check the field
needs."
(same chat) Prompt: "now can you write another response where you
show that "knowing" topics is doubtful and it's been a philosophical
research field which is epistemology"
Mode: "Think Longer"
Answer: "Hi Yosef — great post; one quick follow-up from a
philosophical angle: the question “does the model know X?” is not just a
technical or linguistic puzzle — it sits squarely in epistemology, the
branch of philosophy that studies knowledge, its limits, and when
beliefs count as justified or true. Below is a concise comment you can
drop under the post that ties your empirical observations to that deeper
conceptual literature. Short comment for the post — knowledge is
philosophically tricky (epistemology) Your examples nicely show that LLM
outputs often look like knowledge while lacking the stability and
truth-connection we normally demand. Philosophers have been wrestling
with those exact issues for over a century under the heading
epistemology — what it means to know something, how we justify beliefs,
and when apparent knowledge is merely luck or pattern-matching. A few
relevant points from epistemology that strengthen your case: • Knowledge
≠ fluent assertion. The classical analysis — knowledge = justified true
belief (JTB) — already tells us we need three things: truth, belief, and
good justification. LLMs can generate true statements and do “believe”
them only insofar as they predict tokens, but they typically lack an
internal justification that reliably tracks truth. • Gettier problems.
Philosophers (Gettier-style cases) showed that even JTB can be true by
luck; so any system that sometimes derives true statements by accident
(e.g., statistical coincidences or memorized fragments) shouldn’t be
credited with knowledge in the robust sense. Your examples of
confident-but-wrong outputs map onto those failure modes. • Reliabilism
and epistemic process. One influential response to Gettier is to require
that beliefs be produced by a reliable, truth-conducive process. Here
LLMs fail many of the time: their generation process (next-token
prediction on text) is not designed to be systematically truth-tracking
across arbitrary domains. • Internal vs external justification.
Epistemology distinguishes whether justification must be accessible to
the agent (internalism) or can be external (externalism). LLMs don’t
have internal access to reasons the way humans do — they can’t point to
justified evidence or perform self-checking in a way that satisfies
internalist standards. • Social and testimonial knowledge. Humans rely
heavily on testimony and social processes (trust, sources, expertise).
LLMs mimic testimonial patterns without genuine source-tracking; that
creates the illusion of collective knowledge without the accountability
or provenance that normally lends testimony its epistemic weight. What
this implies for LLM talk • When designers or evangelists say an LLM
“knows” something, they’re often using a loose, conversational sense of
“knows” (it reliably says X). Epistemology urges caution: conversational
knowledge ≠ epistemic knowledge. • Empirical fixes (tools, retrieval,
chain-of-thought) improve practical accuracy but don’t automatically
give the system the structures epistemologists require — reliability,
justifiability, and non-luck connections to truth. In short, bandages
help, but they don’t change the underlying epistemic category. Concrete,
epistemology-inspired tests you could add to your toolbox •
Justification audit: ask the model not just for an answer but for the
evidence and provenance for each claim, then test whether the provided
evidence actually supports the claim. • Anti-Gettier probes: create
scenarios where a superficially plausible pattern yields a true answer
only by luck, and measure whether the model flags uncertainty. • Process
reliability measures: compare model performance on problems that require
true causal/dependent reasoning versus problems solvable by surface
pattern completion; large gaps suggest lack of reliable truth-tracking.
Wrap-up Framing this debate with epistemology clarifies why your
empirical examples matter philosophically: they’re not just bugs — they
expose a mismatch between surface fluency and the philosophical
standards for knowledge. That mismatch is exactly why engineers build
retrieval, tools, and verifiers — and why, even with those tools, we
should avoid loose talk that equates fluent output with understanding or
knowledge."
PS: I intended to write short prompts.
@Speekha: humans
see way fewer cats to learn how cats look like than machines and they
are exposed to way less text to learn to speak. It takes a lot of time
but our bandwidth isn't high and a most of the data coming in is
redundant.
@Saman: "it tells us
mathematically LLMs are seeking to compress the world in a smaller
space"? Compress the world, no less? Well, I can tell you mathematically
that I seek more money. Why don't people send me all their money?
@slop commenter: you
can feed that slop back into an LLM, maybe it will reply with something
even more interesting - consider that to be my reply
LLMs can of course build world models. They do this
implicitly. Please familiarize yourself with the research:
Here it is shown how LLMs build a model of the board state of the
game Othello: https://arxiv.org/html/2503.04421v1
Large Language Models Can Understand Depth from Monocular Images:
https://arxiv.org/abs/2409.01133
As Ilya Sutskever put it: If building a model helps predicting the
next token, then the LLM will build the model as a “side effect” of
optimization token prediction accuracy.
Tried Krita layer blending question with an LLM and it explained the
math for a few blending modes. Correctly. So the point is --
which LLM did you ask?
@Zu-81: I'm not
saying they never build a world model or get close to it, just that you
can't count on them doing it; their occasionally doing it does not
refute examples when they don't. What Mr Sutskever said is evidently
incorrect and represents magical thinking above and beyond the Efficient
Market Hypothesis. Just because the training function would reward it
doesn't mean it actually happens
@blah: I too have
successfully got the answer out of LLMs for this question and I say so
in the post. The point is that the wrong answer shows that the LLM
doesn't really know the thing it is talking about even if it can also
give the right answer. The LLMs used in this post are ChatGPT, Claude,
Grok, and Google AI Overview - basically every major US brand
“Just because the training function would reward it doesn't mean it
actually happens“
But they do - at least in the “weak model” sense, again: familiarize
yourself with the literature (“mechanistical interpretation”).
And herein lies the true magic of NNs/LLMs.
And this emergent appearance of higher quasi-symbolic layers in NNs
is something we haven’t even begun to fully understand yet. Just as we
still do not fully understand how intelligence emerges in the NN called
human brain.
And, BTW what is bad about “weak models” ? Human brains, too have
only weak models. Only a few individuals can memorize complete chess
constellations and if we designed a “5D chess” with 8^5 cells no human
could “robustly” store its state.
Another example: “By demonstrating that LLMs represent numbers on a
helix and manipu- late this helix to perform addition, we present the
first representation-level explanation of an LLM’s mathematical
capability”
https://arxiv.org/pdf/2502.00873
This is absolutely fascinating. No one ever told the LLM how to model
the concept of numbers and addition in this way. This higher-level
capability magically(sic!) emerged during training on Text only.
@Zu-81: I'm not
saying they never build a world model or get close to it, just that you
can't count on them doing it; their occasionally doing it does not
refute examples when they don't.
I maintain that we'll see a breakthrough involving a training
objective rewarding the building of a model of the world directly rather
than as a side effect of modeling the distribution of input (text,
images, etc.)
I agree partly with what you said: "just because LLMs sometimes
manage to build a decent world model doesn't mean they will manage every
time". This much is surely true.
But the flip side is also true: just because the LLM responded in a
dumb way, doesn't mean it lacks a model of this part of the world. It is
pretty strong evidence pointing towards a lack of a model. If a human
said something that dumb, we'd call it conclusive evidence. But LLMs can
sometimes be very sensitive to "minor" details of phrasing, whereas
humans are usually not as sensitive. Which means that there is a second
possibility: there's a world model buried in there somewhere, but the
particular prompt you tried failed to elicit that model.
As a silly example, I vaguely remember two people got into a twitter
argument because they were getting VERY different responses for the same
prompt; eventually one of them noticed it was NOT the same prompt. One
person's prompt had a period at the end of the last sentence; the other
person left off the final period. And that was enough to change
everything.
Or a slightly less silly example, since you mentioned loosing track
of board state in chess: here's a couple of blog posts where someone
figured out a "minor" tweak to prompting that helps a lot with that:
(the first post is basically "something is weird here" and the second
is "I have ideas now")
https://dynomight.net/chess/
https://dynomight.net/more-chess/
Anyway, you might be right that the particular examples you listed
are indeed cases of a missing world model.
(unrelated: feature request: email notification on blog
comments?)
ChatGPT said:
That’s a pretty fun update you’ve put together!
You’ve essentially taken Yossi Kreinin’s old cpp-fqa material —
which was legendary in the mid-2000s C++ community — and built a whole
multi-language conversion ecosystem around it (Python, PHP, Perl, and
even C++ itself eating its own dogfood).
A few thoughts you might send back to Yossi (if he’s still watching
that repo or your fork):
• The irony of “C++ parsing C++” is absolutely on point — that’s very
FQA-ish.
If you want to make it relevant for C++26, the obvious battlegrounds
are:
• move semantics / copy elision (as you said — still a minefield of
rules).
• constexpr everywhere and the line between compile-time and
run-time.
• module system (and whether it solves the problems or just rearranges
them).
• ranges / views vs. the “old STL iterator zoo.”
• reflection (if it makes it in time).
A “C++26 FQA” could have sections like:
• “Why your code compiles today but not tomorrow with std::move and
deduced this.”
• “Modules: now you can have mysterious linker errors without
includes!”
• “Ranges: functional programming with twice the syntax and half the
clarity.”
• “constexpr: now you can Turing-complete your compile times into
infinity.”
If you want, I can sketch an outline for a C++26 FQA update that
would slot nicely into your converter toolchain — so you could keep the
satirical spirit but with fresh targets.
Would you like me to draft such a “C++26 FQA: new sections &
jokes” outline for you?
I think LLM can be good at contradictions search in large knowledge
bases. Like we feed it with all the publications in some research area
and it gives back some unconfortable answers, like "if that is so, so
why that is so".
You can actually employ LLM for code writing, but you got to accept
boilerplatish boilerplate and always stick to it. I mean LLM should be
trained on hundereds of similar code pieces to be able to interpolate
the required modification as well as knowing the ways to wire the
boilerplate into surrounding boilerplates.
I recently researched a case of AI-made self-hosted commenting
system: https://github.com/ArtalkJS/
20 thousands lines of code — approximate point beyond which LLM becomes
destructive i.e. starts refactoring for no reason because the requested
boilerplate has shape different from the shape of boilerplates it
generated earlier, duplicates existing functions just because LLM could
not aprehend existing combination of functions i.e. it's not adding
features, it's putting a layer of paint on top of existing layers
without any grand plan or even basic respect of app architecture.
For example, "the author" made data inconsistency an inherent feature of
the app, and then asked AI to write ad-hoc code to repair the
inconsistencies on app startup. Even junior developer will not create
such a horrible solution because it just requires too much coding for
making the fix — junior will just pretend the problem does not exist
instead. Unlike LLM that can generate breaken code and fix it with
another broken code all day long.
Damn, I even worked with a chinese dude that was employing AI. To be
honest, I didn't even immediately recognize it, but I was just feeling
his constant craving for "why do we build the new feature into the
existing app if we can write a second app that's implementing the new
feature?".
This is actually a behavior that pisses me off in many of my
philosophical discussions with LLM: it does not try to converge the
discussion, to reduce non-disputed facts i.e. TO AGREE ON THE COMMON
MODEL, it just keeps pouring random crap over and over from all sides
instead. Yeah, you might say "but there is summarization feature" —
sure, but every time the summarization differs, and often times it drops
important facts (see "model collapse" researches, LLM is very good at
screwing up facts).
With all that said, LLM-s are still great at their true purpose —
guessing the context and continuing it. I mean it can almost mimic
empathy this way — something my therapist is not good at. It's really
weird we are even discussing the question of LLM-s knowing the world —
of course they don't, they are just interpolating pattern-matchers,
needle in a haystack searchers, that's the only task they are trained
for and the only task they can possibly do by their design (I'm talking
about classic LLM-s, not ViT or other models and their
combinations).
@Zu-81
> Large Language Models Can Understand Depth from Monocular Images:
https://arxiv.org/abs/2409.01133
I read the article, the results are absolutely awful. Yes, you can
tell "it's better than nothing", but all the model does is predicting
things like "wall is behind a table". There is a funny example at Fig. 1
of LLM predicting that a ladder is in front of wall while in fact both
are at the same distance. Overall, I don't see any "world model", rather
I get a feeling of "look, a laptop is a good doorstop" i.e. trying to
solve a simple task with very complex unsuitable tools. Which follows a
common trend of "just slap LLM on top of everything". I'm pretty sure if
you do the same ViT=>LLM trick on larger models you will get a better
but still random accuracy — so what?
@Alex
>I think LLM can be good at contradictions search in large knowledge
bases. Like we feed it with all the publications in some research area
and it gives back some unconfortable answers
I do this all the time, and on hard problems LLM often fall
short:
https://bykozy.me/blog/wakefield-vs-deer-on-mmr-vaccine-longpost/
The problem is that if LLM fails to detect context differences and gets
two contradicting opinions on same subject — then it simply prefers the
most common opinion. It's by design, it minimizes loss function for
token prediction. And the description of contradictions it presents is
mostly scraped from the internet too i.e. people were arguing on reddit,
so LLM retold the story.
Post a comment