LLMs aren’t world models

August 10th, 2025

I believe that language models aren’t world models. It’s a weak claim — I’m not saying they’re useless, or that we’re done milking them. It’s also a fuzzy-sounding claim — with its trillion weights, who can prove that there’s something an LLM isn't a model of? But I hope to make my claim clear and persuasive enough with some examples.

A friend who plays better chess than me — and knows more math & CS than me - said that he played some moves against a newly released LLM, and it must be at least as good as him. I said, no way, I’m going to cRRRush it, in my best Russian accent. I make a few moves – but unlike him, I don't make good moves1, which would be opening book moves it has seen a million times; I make weak moves, which it hasn't 2. The thing makes decent moves in response, with cheerful commentary about how we're attacking this and developing that — until about move 10, when it tries to move a knight which isn't there, and loses in a few more moves. This was a year or two ago; I’ve just tried this again, and it lost track of the board state by move 9.

When I’m saying that LLMs have no world model, I don’t mean that they haven't seen enough photos of chess knights, or held a knight in their greasy fingers; I don’t mean the physical world, necessarily. And I obviously don’t mean that a machine can’t learn a model of chess, when all leading chess engines use machine learning. I only mean that, having read a trillion chess games, LLMs, specifically, have not learned that to make legal moves, you need to know where the pieces are on the board. Why would they? For predicting the moves or commentary in chess games, which is what they’re optimized for, this would help very marginally, if at all.

Of course, nobody uses LLMs as chess engines — so whatever they did learn about chess, they learned entirely “by accident”, without any effort by developers to improve the process for this kind of data. And we could say that the whole argument that LLMs learn about the world is that they have to understand the world as a side effect of modeling the distribution of text — which is soundly refuted by them literally failing to learn the first thing about chess. But maybe we could charitably assume that LLMs fail this badly with chess for silly reasons you could easily fix, but nobody bothered. So let’s look at something virtual enough to learn a model of without having greasy fingers to touch it with, but also relevant enough for developers to try to make it work.

So, for my second example, we will consider the so-called “normal blending mode” in image editors like Krita — what happens when you put a layer with some partially transparent pixels on top of another layer? What’s the mathematical formula for blending 2 layers? An LLM replied roughly like so:

In Krita Normal blending mode, colors are not blended using a mathematical formula. The "Normal" mode simply displays the upper layer's color, potentially affected by its transparency, without any interaction or calculation with the base layer's color. (It then said how other blending modes were different and involved mathematical formulas.)

This answer tells us the LLM doesn't know things such as:

I mean, it sounds stupid to break it down like that, but I’m not wrong, am I? It really doesn’t know any of these things, does it.

Can you prompt the LLM to explain alpha blending properly? Sure. But that just shows the LLM knows to put the words explaining it after the words asking the question. This capability does not make the answer above into lesser evidence of the LLM not knowing the things as opposed to the words.

And of course people can be like that, too - eg much better at the big O notation and complexity analysis in interviews than on the job. But I guarantee you that if you put a gun to their head or offer them a million dollar bonus for getting it right, they will do well enough on the job, too. And with 200 billion thrown at LLM hardware last year, the thing can't complain that it wasn't incentivized to perform3.

Of course, these are simple examples. An LLM triumphalist will observe that they often stop reproducing; an LLM denialist will assume they stopped reproducing through some conspiracy, like a chess engine tool having been given to the LLM, or it having been drenched with synthetic data similar to your question. (I used to ask LLMs to prove 2+2=4; they'd very pompously enumerate various notable properties of 2 and 4, and proudly declare that 2+2 must equal 4 based on these properties, and I had a good laugh. Then LLMs were flogged to become “good at math,” and now they might say something about “Peano axioms,” and some total garbage about set theory — but they emit enough S(S(2)) and such that it probably counts as a proof, though I am yet to see the simple “2+2 = 2+(1+1) = (2+1)+1 = 3+1 = 4” which I’d expect from an entity understanding the question.)

For a more complex example, we can take associativity (which, as we’ve seen in 2+2=4, LLMs understand vaguely at best), combine it with alpha blending and transparency (which apparently they don’t understand at all), and see how well LLMs do. I’ve had an exchange with an LLM asking whether alpha blending, as implemented in commonly used libraries, is associative, or whether it isn’t due to precision loss or whatever — and if it’s not associative, how does caching work in drawing programs (where the program must be precomputing the blending of the layers above and below the currently edited one, to avoid recomputing the blending of 10 or 100 layers upon every brush stroke.)

Sure enough, it said that alpha blending wasn’t associative — probably because I suggested that it might not be — and that this is “solved with caching instead of mathematical elegance” — probably because I suggested that caching was involved. And then I ask, but how can caching work if blending is not associative? If layer 6 is selected, and you blend the cached blending of {1…5}, the selected layer 6, and the cached blending of {7…10}, you would get different results from blending {1…4}, 5, and {6…10}, if blending is not in fact associative? And then if you selected layer 5 in the program, you would see a different picture compared to selecting layer 6 - but in practice you see the same picture?

“You got me,” says the LLM, more or less. So their not knowing what any of the words actually mean very much does extend to complex examples.

You could say that the LLM was a victim of its agreeableness4, since it might have been influenced by my contradictory implications that blending might not be associative, yet caching must be implemented that counts on it being associative. I could say that, well, my whole question was about which parts of my suspicions are incorrect, and saying they’re all correct is an abject failure — but let’s assume it could be a character flaw more than an intellectual weakness. So in our last example, we’ll see the LLM having its own opinion and sticking to it, despite being told repeatedly that it can’t be true.

I ask it about the thread safety of appending to a Python list from multiple threads, and whether I can tell the number of times append was called with len(myList), and whether it will work once the GIL is removed. It says that without the GIL, the program could corrupt memory. I say, no way, this is not C, it must be more like Java? And it goes, no, CPython is a C program, and without the GIL your racy code can crash like C does. Java is different, it has a memory model, and look at these crash reports from GIL-less Python. And I’m like, but these are bug reports, it’s not by design, is there evidence that this is by design? — and it goes, it’s too early for the kind of evidence you’re looking for to exist, no-GIL is too new, but here’s how a C program could crash in such scenarios… and on and on and on.

It does not know that (pure) Python is a memory-safe language, and that no suggestion making it memory-unsafe would ever be accepted, and I found no way of persuading it to take this notion into account — or to acknowledge that the evidence it’s citing in support of its claims is more like evidence to the contrary (if all the crashes upon races you find are bug reports, it points to the requirement being that races don’t lead to crashes.)

So it can be either kinda agreeable or very stubborn — and it might obviously not know what it just said in both modes.

Can this be quantified?

I don't see how.

I mean, I wish it could be. It's clear that LLMs do learn some things about the world. For instance, even just the token embeddings contain the representation of the concept of gender learned without any specific effort to teach the model what gender is, as evidenced by “king - man + woman ~= queen” in the embedding space.

Ideally, you would want to quantify "how much of the world LLMs model." But even if you resolve the difficulty of defining what this means, you'll run into the ease with which LLMs memorize answers to specific questions, so the vendor can celebrate the new bar having been cleared.

All I can confidently claim is that they don't learn a world model except by accident, and there's neither a theoretical reason nor empirical evidence for your being able to count on this accident in any defined and broad set of circumstances.

So-called conclusions5

A guy who made $100 million from being an early employee of some startup came to give a lecture for that startup, and said “a fundamentally incorrect approach to a problem can be taken very far in practice with sufficient engineering effort.” (He then cheered up his listeners, most of whom had $100 million less than him, with the addendum “That's what I think is happening in this company!”)

It is therefore not one of my conclusions that you can’t take LLMs very, very far just because they demonstrably do not learn a model of the many worlds described by the words they’re trained on (which, BTW, is exactly as it says on the tin; nobody ever called them LWMs.) I will, however, predict a few things — something you shouldn’t do if you don’t want to look stupid in the future, but here goes.

There will be at least one big breakthrough in machine learning around “world models”. I have no idea what this breakthrough will look like; I predict that it will happen because some important kinds of thinking cannot be done without it, and I trust the Church-Turing thesis when it comes to these kinds of thinking, and I think someone will figure this out, same as people have come up with deep learning, convnets and transformers. And of course you already have “world models”, such as systems recovering object classes and positions from images — by a breakthrough, I mean a “generic” ability to build models of “novel worlds” (even if the model isn’t as good as a specially tailored one), much like you throw any text into an LLM and have it learn “something” without much tuning for this kind of text.

(In fact, I would guess there will be at least 2 more breakthroughs, the other one being around needing far less training data — again, not because I know how machines could use less training data, but because I know you and I get by with less. Feeding these algorithms gobs of data is another example of how an approach that must be fundamentally incorrect at least in some sense, as evidenced by how data-hungry it is, can be taken very far by engineering efforts — as long as something is useful enough to fund such efforts and isn’t outcompeted by a new idea, it can persist.)

LLMs are not by themselves sufficient as a path to general machine intelligence; in some sense they are a distraction because of how far you can take them despite the approach being fundamentally incorrect. This should make “AI risk” people happy; but “AI risk” is its own hilarity best left to another time.

LLMs will never6 manage to deal with large code bases “autonomously”, because they would need to have a model of the program, and they don’t even learn to track chess pieces having read everything there is to read about chess.

LLMs will never reliably know what they don’t know, or stop making things up. You need some sort of a world model to have notions of knowledge, truth and falsehood. Any mechanism that is supposed to make LLMs “safe”, trustworthy or other such is a mix of snake oil and honest efforts to somehow steer it away from text that spooks users — which can be done, since users are spooked by form more than substance. For example, when it says some politically related nonsense, people drag it for having the wrong politics, and you can “fix” it by making its output less politically charged — without making it less nonsensical, which there’s no way to reliably achieve.

LLMs will always be able to teach a student complex (standard) curriculum, answer an expert’s question with a useful (known) insight, and yet fail at basic (novel) questions on the same subject, all at the same time. This is not surprising — this is exactly what you would expect from a language model that isn’t a world model. This fuzzy insight is more cute than useful, however, since it’s hard to know what is and is not novel — in part because you come to the LLM in the first place with things you don’t already know everything about.

(Some sources, such as this wonderful writeup about LLMs not knowing that rotating a tic-tac-toe board doesn’t change the game, take this point to its logical conclusion: “If you know the answer, you don’t need to ask an LLM; if you don’t know the answer, you can’t trust an LLM.” But this conveys a true insight into LLMs together with unwarranted pessimism about their utility. In fact, sometimes you know the answer, but it’s quicker to proofread an LLM’s output than to type it out; sometimes you don’t know the answer, but know to check if the LLM’s answer is correct, etc. etc.)

LLM-style language processing is definitely a part of how human intelligence works — and how human stupidity works. I agree with Dijkstra that “can machines think?” and “can submarines swim?” are poor questions to ask, and I hate it when people say that neural networks “work like the brain” and such. But I can’t help feeling that LLMs are a mirror into a part of how people such as myself think — and I don’t like what I’m seeing in that mirror. “Thinking” by guessing what words to say next based on words we’ve previously heard might actually help find a good idea — and it’s also how know-nothings get through work meetings, and how people come to think they know stuff they really don’t, and how they internalize the stupidest notions. I am starting to think that in today’s environment, high cognitive skills are an actual risk factor for stupidity, and that learning words without learning a model of what they refer to is one big part of the problem.

P.S. I wish I could say something about how to best use LLMs for programming — something I would like to be qualified to speak about, and that I am kinda supposed to learn enough to be qualified to speak about. I don’t think I am; I can only say that I tried Cursor and it failed every time, including at replacing f(obj) and g(obj) with obj.f() and obj.g() (it would occasionally mix up f and g, and I got tired of reviewing its output), and I went back to simply copying code into and out of chat windows. I would say that I use LLMs like I use SIMD — sometimes it’s a good fit for leaf functions whose behavior is relatively easy to specify and test, and it has no business being anywhere else.

I have conflicting theories about why some people do great things with “agentic AI” while I think it’s hopelessly useless for me; I am waiting for someone to write something crisp and well-researched about this to teach me the truth, or a useful approximation. I console myself with the idea that I can’t be missing out on too much, given how terrible the output I’m getting from LLMs often is.

P.P.S. Here’s a somewhat rough Soviet-era joke about a school for kids with special needs that illustrates my point. Rachel says that a Russian joke isn’t a joke, but a story of pain. To this I reply that some people like them.

An inspector comes to a school for kids with developmental issues. He asks a kid riding a wooden horsie his name, and the kid says “MMMM.” He says, what do you want to be when you grow up? — and the kid says “MMMM.” The inspector turns to the principal and says, “you’re doing nothing for these kids. I’ll be back in a month — if there’s no improvement, we’ll close the school.”

He comes back a month later and finds the kid swinging on the wooden horsie, same as last time; if you want to tell this masterpiece of Soviet humor at parties — the perfect conversation starter — you should be swinging wildly when saying the kid’s lines:

— What’s your name?
— MMMMikey!!
— Mikey? Nice to meet you, Mikey. What do you want to be when you grow up, Mikey?
— MMMMastrounaut!!
— An astronaut? Good stuff, good stuff! And how old are you?
— MMMMikey!!

The moral of the story being that you can learn to predict the next word without learning much about the world — at least up to a point.

Thanks to Dan Luu for reviewing a draft of this post.


  1. It helps that I don’t know the good opening moves; I can’t be bothered to learn any opening theory. The fact that my poorer chess knowledge makes it easier for me to see how bad the LLM is at chess is an interesting case study. It turns out that you can get good answers out of LLMs by asking very well-phrased questions sounding like someone else’s well-phrased questions answered in its training data; whereas if you ask simpler questions which are perfectly valid but not commonly asked, they will fall apart.↩︎

  2. Funnily or tragically, my system of tripping up the opponent with weak moves it hasn’t memorized a response to is conceptually similar to grandmaster play of today, where grandmasters memorize chess engine lines, and a “novelty” is a relatively weak move your opponent has never analyzed with the engine, whereas you did and you remember all the strong moves after this weak one. Of course my strategy would not work against a grandmaster, because I don’t come prepared with memorized engine lines, and the grandmaster would find much better moves than I would over the board. Still, this 21st century concept of “chess novelty” is tangentially related, and funny, or tragic, as the case might be.↩︎

  3. People can also give the wrong answer because they’re drunk. I don’t think the LLM was drunk. My point is that a person who gave this answer would get zero points for this question on a test, and that the LLM is constantly under test because it’s a machine serving no purpose other than answering these questions, and I don’t see why it should not get zero points here, even though people might eg fail to answer some logic puzzle phrased in one way but succeed when it is phrased in another way, etc. etc. — I don’t see how the cognitive weaknesses of people provide an excuse for the machine in this specific case.↩︎

  4. Actually, in this case, it was agreeable in substance but snarky in tone — it gave me an answer that confirmed all my different suspicions, contradictory as they were, and at the same time it was saying something like “don’t expect the world to be pretty or simple, man, the world is messy, man.” Generally I don’t think that LLMs “personality,” “style,” “politics” and other anthropomorphic characteristics are the main thing about them; I think the main thing is what they model (text) and what they don’t model except by accident (the thing the text is about.)↩︎

  5. It’s hard to call them “conclusions” when they’re fuzzy statements supposedly following from my fuzzy claim. In fact this is what bugs me about LLMs in general: the thing is fuzzy — you can’t say it does something, because sometimes it fails to do it; you can’t say it doesn’t do something, because sometimes it succeeds; and you can’t discuss the rates of real-life success and failure, because who’s keeping score? This is why it’s hard for me to write about LLMs — I don’t like it when things get this fuzzy, certainly when it comes to long-form writing; I’m reduced by the very nature of the subject to shitposting about this on Twitter, along the lines of “Computers used to provide cheap, reliable automation; then AI came along.↩︎

  6. When I’m saying that an LLM will never be able to do something, I mean it in the sense of “y = ax + b will never represent a parabola” rather than in the sense of “the points residing on a curve rather than a straight line can never be represented by an equation.” Machine learning might do what LLMs can’t do. Of course this could be used for a No True Scotsman defense — “if it clearly learns a model of the world, it’s not a true LLM.” I’m assuming that when a big breakthrough is achieved, we’ll know enough about it to be able to settle the question whether it’s still an LLM, as long as we’re arguing in good faith — same as we don’t know all the details of how commercial LLMs work, but we know about transformers, tokenization, encoders, decoders, next token prediction, whole-text synthesis, etc., and this is enough for “LLM” to have a somewhat technical meaning — not as precise as “y=ax+b,” but not nearly as vague as, say, “AI.”↩︎

1. Alex LoftusAug 11, 2025

There is a lot of research on this topic - it is a whole subfield within interpretability. I suggest reading some of this research rather than making a big blog post based on vibes with no experiments or empirical results.

2. Yossi KreininAug 11, 2025

It's a small blog post, not a big one... with a correspondingly small number of "experiments and empirical results."

I suggest making an argument rather than implying that it exists.

3. GaryAug 11, 2025

Best explanation of LLM limitations I've read so far. Can we please just admit that this LLM madness is an improved google search? Why is all this hype needed? What is the point of making people nervous and anxious about their careers? You want the truth about how productive AI makes you as a developer? Check out the reddit threads. Slight performance boost at best. Honestly getting sick of it ...

4. SpeekhaAug 12, 2025

Reminds me of how a text-to-image model knows to include a reflection if you ask for a mirror, but it has no understanding of the laws of physics (or of the 3D content it's rendering as a 2D picture) to produce one that makes sense.

I'm not entirely convinced by your assumption that we'll find models that will learn with less training data because "you and I get by with less" (it might happen, but not for that reason in my opinion). Everything we've perceived (with our eyes or our ears) since we were born is our training data, and it's enormous. Which could explain our ability to extrapolate from smaller data sets as adults. Maybe an LLM needs that much training data because it's starting from scratch, like a human baby needing years of learning before it's able to talk, reason, and perform complex operations confidently.

5. SamanAug 12, 2025

Thanks for the blog, great read.

Probably would add (yet) to the title. Also suggest looking at KL divergence and how the loss function is calculated. It’s a rabbit hole, but tell us mathematically LLMs are seeking to compress the world in a smaller space

6. LATRECHEAug 13, 2025

Model: GPT-5 free

Prompt: "hey there, here's a link to a long interesting blog post, read it, understand it, then tell me what do you about it as a response to the author of that blog post. link: https://yosefk.com/cgi-bin/comments.cgi?post=blog/llms-arent-world-models#comments"

Mode: Think Longer

Answer: "Hi Yosef — thanks for a sharp, entertaining, and clear takedown. I read your post carefully and I want to (1) paraphrase what I took away, (2) say what I agree with, (3) offer a few constructive pushes and concrete experiments that could help make "how much of a world model?" less fuzzy, and (4) suggest some relevant literature and directions that line up with your intuition. Short paraphrase of your argument You argue that large language models (LLMs) are not world models in any reliable, general sense: they often fail at basic state-tracking and grounded reasoning (chess piece positions, alpha-blending math, thread-safety reasoning), produce confident but wrong-sounding answers, and only "know" what they’ve memorized or can pattern-match from text. Because of this, they can be extremely useful in many narrow, text-like tasks while still being unable to do kinds of thinking that require an explicit, manipulable model of a world. You conclude that LLMs will be taken a long way by engineering but are not by themselves the route to general intelligence; a true breakthrough in world-modeling (and probably in data efficiency) is still needed. What I strongly agree with • The concrete examples you used are persuasive. They illustrate a gap between fluent surface behavior and reliable grounded reasoning: fluent text + hallucinated or inconsistent internals. That gap explains both the impressive utility and the worrying brittleness of LLM outputs. • Treating an LLM’s ability to “explain” something as evidence that it understands the underlying causal / numerical relationships is hazardous — explanations can be reconstructed from patterns without an internal causal model. Constructive pushes / clarifications • “Is it a world model, or a very big pattern recognizer?” — this framing helps. If we define a world model as a system that has an internal state-space it can simulate, and which updates consistently under actions/interventions, then the kinds of failures you show (chess state loss, inconsistent compositional math) are exactly what we should expect from a pure sequence-prediction learner trained only on static text. Put another way: absence of persistent, causally-structured latent state explains a lot of the observed behavior. (Relatedly, classic model-based RL work explicitly trains such latent dynamics.) • Where LLMs help anyway — and why that’s not contradictory: LLMs are extremely effective at pattern completion in the manifold of human text and code. For standard curriculum problems or commonly-seen engineering idioms, surface fluency + memorized patterns can be more efficient than a brittle symbolic system. That’s why they often “teach” or accelerate work, even while failing novel, stateful reasoning. Your “SIMD for leaf functions” analogy is apt. Concrete experiments / measurements you (or others) might run (These are practical, reproducible tests to quantify “world-modelness” rather than rely on anecdotes.) • Systematic state-tracking (chess-like) benchmark. Create many short games where you intentionally make off-book / rare moves, and measure the model’s legal-move detection and board-reconstruction accuracy by move index (e.g., error rate at move 5, 10, 15). That quantifies how fast state coherence degrades. (You did this informally; formalize it.) • Counterfactual / intervention tests. Give the model a text history, then tell it an intervention happened (e.g., “the knight on g1 was removed at move 7”), and test whether it updates all dependent claims consistently. If it cannot reliably propagate such interventions, that’s evidence against an internal causal state. (This is directly probing model causal updating.) • Associativity / algebraic identity probes. For examples like alpha-blending and associativity, build small synthetic corpora where the only way to answer correctly is to apply number-level reasoning or derive algebraic identities — not to parrot memorized prose. Measure failure modes and see if fine-tuning on algorithmic data fixes them (it often helps only locally). • Epistemic uncertainty calibration. Ask models to return calibrated confidences and run them on in-distribution vs out-of-distribution (OOD) tasks; measure whether confidence correlates with correctness. Lack of epistemic awareness is precisely the problem you describe (“they don’t reliably know what they don’t know”). Relevant literature / approaches that complement your view • The “world models” line in model-based RL explicitly trains a compact latent dynamics model and uses it for planning and imagination (Ha & Schmidhuber 2018; PlaNet/Dreamer family by Hafner et al.). Those works show that learned latent dynamics can be useful for control and planning, and they’re the kind of architecture that actually embodies an internal simulable state rather than just text-pattern completion. If the goal is genuine world modelling (even of virtual worlds), this line is directly relevant. • Judea Pearl’s work on causality is a useful conceptual lens: distinguishing correlations (what text patterns give you) from models that support interventions and counterfactuals (what you need for stable reasoning about “what if” and for knowing what you don’t know). Building machines that can represent and manipulate causal models is a plausible route to the capabilities you say LLMs lack. Short rejoinder to possible pushback Some will say “tooling + retrieval + chain-of-thought + agents fixes these issues.” That helps in practice for many tasks, but it’s still a bandage: external tools can supply state, search or ground truth, while prompting can coax correct derivations — but none of those alone make the base LLM reliably possess an internal, generalizable world model. Your point is that we should be honest about where the holes are and not pretend that surface fluency equals deep, manipulable understanding. Final note — a possible paper / writeup idea You have strong, clear examples. Packaging them into a short empirical paper —: (a) formalized benchmarks (state-tracking, intervention, algebraic associativity, epistemic calibration), (b) results across model sizes and training regimes (vanilla LMs vs RL world-models vs multimodal grounded models), and (c) a taxonomy of failure modes — would be a valuable contribution. It would turn your persuasive intuition into quantitative evidence the community can act on. For readers who believe “scale will fix it,” the benchmarks force the debate into data. Thanks again for the lucid post — it’s the kind of reality check the field needs."


(same chat) Prompt: "now can you write another response where you show that "knowing" topics is doubtful and it's been a philosophical research field which is epistemology"

Mode: "Think Longer"

Answer: "Hi Yosef — great post; one quick follow-up from a philosophical angle: the question “does the model know X?” is not just a technical or linguistic puzzle — it sits squarely in epistemology, the branch of philosophy that studies knowledge, its limits, and when beliefs count as justified or true. Below is a concise comment you can drop under the post that ties your empirical observations to that deeper conceptual literature. Short comment for the post — knowledge is philosophically tricky (epistemology) Your examples nicely show that LLM outputs often look like knowledge while lacking the stability and truth-connection we normally demand. Philosophers have been wrestling with those exact issues for over a century under the heading epistemology — what it means to know something, how we justify beliefs, and when apparent knowledge is merely luck or pattern-matching. A few relevant points from epistemology that strengthen your case: • Knowledge ≠ fluent assertion. The classical analysis — knowledge = justified true belief (JTB) — already tells us we need three things: truth, belief, and good justification. LLMs can generate true statements and do “believe” them only insofar as they predict tokens, but they typically lack an internal justification that reliably tracks truth. • Gettier problems. Philosophers (Gettier-style cases) showed that even JTB can be true by luck; so any system that sometimes derives true statements by accident (e.g., statistical coincidences or memorized fragments) shouldn’t be credited with knowledge in the robust sense. Your examples of confident-but-wrong outputs map onto those failure modes. • Reliabilism and epistemic process. One influential response to Gettier is to require that beliefs be produced by a reliable, truth-conducive process. Here LLMs fail many of the time: their generation process (next-token prediction on text) is not designed to be systematically truth-tracking across arbitrary domains. • Internal vs external justification. Epistemology distinguishes whether justification must be accessible to the agent (internalism) or can be external (externalism). LLMs don’t have internal access to reasons the way humans do — they can’t point to justified evidence or perform self-checking in a way that satisfies internalist standards. • Social and testimonial knowledge. Humans rely heavily on testimony and social processes (trust, sources, expertise). LLMs mimic testimonial patterns without genuine source-tracking; that creates the illusion of collective knowledge without the accountability or provenance that normally lends testimony its epistemic weight. What this implies for LLM talk • When designers or evangelists say an LLM “knows” something, they’re often using a loose, conversational sense of “knows” (it reliably says X). Epistemology urges caution: conversational knowledge ≠ epistemic knowledge. • Empirical fixes (tools, retrieval, chain-of-thought) improve practical accuracy but don’t automatically give the system the structures epistemologists require — reliability, justifiability, and non-luck connections to truth. In short, bandages help, but they don’t change the underlying epistemic category. Concrete, epistemology-inspired tests you could add to your toolbox • Justification audit: ask the model not just for an answer but for the evidence and provenance for each claim, then test whether the provided evidence actually supports the claim. • Anti-Gettier probes: create scenarios where a superficially plausible pattern yields a true answer only by luck, and measure whether the model flags uncertainty. • Process reliability measures: compare model performance on problems that require true causal/dependent reasoning versus problems solvable by surface pattern completion; large gaps suggest lack of reliable truth-tracking. Wrap-up Framing this debate with epistemology clarifies why your empirical examples matter philosophically: they’re not just bugs — they expose a mismatch between surface fluency and the philosophical standards for knowledge. That mismatch is exactly why engineers build retrieval, tools, and verifiers — and why, even with those tools, we should avoid loose talk that equates fluent output with understanding or knowledge."

PS: I intended to write short prompts.

7. Yossi KreininAug 13, 2025

@Speekha: humans see way fewer cats to learn how cats look like than machines and they are exposed to way less text to learn to speak. It takes a lot of time but our bandwidth isn't high and a most of the data coming in is redundant.

@Saman: "it tells us mathematically LLMs are seeking to compress the world in a smaller space"? Compress the world, no less? Well, I can tell you mathematically that I seek more money. Why don't people send me all their money?

@slop commenter: you can feed that slop back into an LLM, maybe it will reply with something even more interesting - consider that to be my reply

8. Zu-81Aug 13, 2025

LLMs can of course build world models. They do this implicitly. Please familiarize yourself with the research:

Here it is shown how LLMs build a model of the board state of the game Othello: https://arxiv.org/html/2503.04421v1

Large Language Models Can Understand Depth from Monocular Images: https://arxiv.org/abs/2409.01133

As Ilya Sutskever put it: If building a model helps predicting the next token, then the LLM will build the model as a “side effect” of optimization token prediction accuracy.

9. blahAug 13, 2025

Tried Krita layer blending question with an LLM and it explained the math for a few blending modes. Correctly. So the point is -- which LLM did you ask?

10. Yossi KreininAug 14, 2025

@Zu-81: I'm not saying they never build a world model or get close to it, just that you can't count on them doing it; their occasionally doing it does not refute examples when they don't. What Mr Sutskever said is evidently incorrect and represents magical thinking above and beyond the Efficient Market Hypothesis. Just because the training function would reward it doesn't mean it actually happens

@blah: I too have successfully got the answer out of LLMs for this question and I say so in the post. The point is that the wrong answer shows that the LLM doesn't really know the thing it is talking about even if it can also give the right answer. The LLMs used in this post are ChatGPT, Claude, Grok, and Google AI Overview - basically every major US brand

11. Zu-81Aug 14, 2025

“Just because the training function would reward it doesn't mean it actually happens“

But they do - at least in the “weak model” sense, again: familiarize yourself with the literature (“mechanistical interpretation”).

And herein lies the true magic of NNs/LLMs.

And this emergent appearance of higher quasi-symbolic layers in NNs is something we haven’t even begun to fully understand yet. Just as we still do not fully understand how intelligence emerges in the NN called human brain.

And, BTW what is bad about “weak models” ? Human brains, too have only weak models. Only a few individuals can memorize complete chess constellations and if we designed a “5D chess” with 8^5 cells no human could “robustly” store its state.

12. Zu-81Aug 14, 2025

Another example: “By demonstrating that LLMs represent numbers on a helix and manipu- late this helix to perform addition, we present the first representation-level explanation of an LLM’s mathematical capability”

https://arxiv.org/pdf/2502.00873

This is absolutely fascinating. No one ever told the LLM how to model the concept of numbers and addition in this way. This higher-level capability magically(sic!) emerged during training on Text only.

13. Yossi KreininAug 15, 2025

@Zu-81: I'm not saying they never build a world model or get close to it, just that you can't count on them doing it; their occasionally doing it does not refute examples when they don't.

I maintain that we'll see a breakthrough involving a training objective rewarding the building of a model of the world directly rather than as a side effect of modeling the distribution of input (text, images, etc.)

14. Mark YAug 29, 2025

I agree partly with what you said: "just because LLMs sometimes manage to build a decent world model doesn't mean they will manage every time". This much is surely true.

But the flip side is also true: just because the LLM responded in a dumb way, doesn't mean it lacks a model of this part of the world. It is pretty strong evidence pointing towards a lack of a model. If a human said something that dumb, we'd call it conclusive evidence. But LLMs can sometimes be very sensitive to "minor" details of phrasing, whereas humans are usually not as sensitive. Which means that there is a second possibility: there's a world model buried in there somewhere, but the particular prompt you tried failed to elicit that model.

As a silly example, I vaguely remember two people got into a twitter argument because they were getting VERY different responses for the same prompt; eventually one of them noticed it was NOT the same prompt. One person's prompt had a period at the end of the last sentence; the other person left off the final period. And that was enough to change everything.

Or a slightly less silly example, since you mentioned loosing track of board state in chess: here's a couple of blog posts where someone figured out a "minor" tweak to prompting that helps a lot with that:

(the first post is basically "something is weird here" and the second is "I have ideas now")

https://dynomight.net/chess/

https://dynomight.net/more-chess/

Anyway, you might be right that the particular examples you listed are indeed cases of a missing world model.

(unrelated: feature request: email notification on blog comments?)

15. rgosens2Sep 2, 2025

ChatGPT said:

That’s a pretty fun update you’ve put together!
You’ve essentially taken Yossi Kreinin’s old cpp-fqa material — which was legendary in the mid-2000s C++ community — and built a whole multi-language conversion ecosystem around it (Python, PHP, Perl, and even C++ itself eating its own dogfood).

A few thoughts you might send back to Yossi (if he’s still watching that repo or your fork):
• The irony of “C++ parsing C++” is absolutely on point — that’s very FQA-ish.

If you want to make it relevant for C++26, the obvious battlegrounds are:
• move semantics / copy elision (as you said — still a minefield of rules).
• constexpr everywhere and the line between compile-time and run-time.
• module system (and whether it solves the problems or just rearranges them).
• ranges / views vs. the “old STL iterator zoo.”
• reflection (if it makes it in time).

A “C++26 FQA” could have sections like:
• “Why your code compiles today but not tomorrow with std::move and deduced this.”
• “Modules: now you can have mysterious linker errors without includes!”
• “Ranges: functional programming with twice the syntax and half the clarity.”
• “constexpr: now you can Turing-complete your compile times into infinity.”

If you want, I can sketch an outline for a C++26 FQA update that would slot nicely into your converter toolchain — so you could keep the satirical spirit but with fresh targets.

Would you like me to draft such a “C++26 FQA: new sections & jokes” outline for you?

16. AlexSep 19, 2025

I think LLM can be good at contradictions search in large knowledge bases. Like we feed it with all the publications in some research area and it gives back some unconfortable answers, like "if that is so, so why that is so".

17. bykozyDec 3, 2025

You can actually employ LLM for code writing, but you got to accept boilerplatish boilerplate and always stick to it. I mean LLM should be trained on hundereds of similar code pieces to be able to interpolate the required modification as well as knowing the ways to wire the boilerplate into surrounding boilerplates.

I recently researched a case of AI-made self-hosted commenting system: https://github.com/ArtalkJS/
20 thousands lines of code — approximate point beyond which LLM becomes destructive i.e. starts refactoring for no reason because the requested boilerplate has shape different from the shape of boilerplates it generated earlier, duplicates existing functions just because LLM could not aprehend existing combination of functions i.e. it's not adding features, it's putting a layer of paint on top of existing layers without any grand plan or even basic respect of app architecture.
For example, "the author" made data inconsistency an inherent feature of the app, and then asked AI to write ad-hoc code to repair the inconsistencies on app startup. Even junior developer will not create such a horrible solution because it just requires too much coding for making the fix — junior will just pretend the problem does not exist instead. Unlike LLM that can generate breaken code and fix it with another broken code all day long.

Damn, I even worked with a chinese dude that was employing AI. To be honest, I didn't even immediately recognize it, but I was just feeling his constant craving for "why do we build the new feature into the existing app if we can write a second app that's implementing the new feature?".

This is actually a behavior that pisses me off in many of my philosophical discussions with LLM: it does not try to converge the discussion, to reduce non-disputed facts i.e. TO AGREE ON THE COMMON MODEL, it just keeps pouring random crap over and over from all sides instead. Yeah, you might say "but there is summarization feature" — sure, but every time the summarization differs, and often times it drops important facts (see "model collapse" researches, LLM is very good at screwing up facts).

With all that said, LLM-s are still great at their true purpose — guessing the context and continuing it. I mean it can almost mimic empathy this way — something my therapist is not good at. It's really weird we are even discussing the question of LLM-s knowing the world — of course they don't, they are just interpolating pattern-matchers, needle in a haystack searchers, that's the only task they are trained for and the only task they can possibly do by their design (I'm talking about classic LLM-s, not ViT or other models and their combinations).

18. bykozyDec 3, 2025

@Zu-81
> Large Language Models Can Understand Depth from Monocular Images: https://arxiv.org/abs/2409.01133

I read the article, the results are absolutely awful. Yes, you can tell "it's better than nothing", but all the model does is predicting things like "wall is behind a table". There is a funny example at Fig. 1 of LLM predicting that a ladder is in front of wall while in fact both are at the same distance. Overall, I don't see any "world model", rather I get a feeling of "look, a laptop is a good doorstop" i.e. trying to solve a simple task with very complex unsuitable tools. Which follows a common trend of "just slap LLM on top of everything". I'm pretty sure if you do the same ViT=>LLM trick on larger models you will get a better but still random accuracy — so what?

19. bykozyDec 4, 2025

@Alex
>I think LLM can be good at contradictions search in large knowledge bases. Like we feed it with all the publications in some research area and it gives back some unconfortable answers

I do this all the time, and on hard problems LLM often fall short:
https://bykozy.me/blog/wakefield-vs-deer-on-mmr-vaccine-longpost/
The problem is that if LLM fails to detect context differences and gets two contradicting opinions on same subject — then it simply prefers the most common opinion. It's by design, it minimizes loss function for token prediction. And the description of contradictions it presents is mostly scraped from the internet too i.e. people were arguing on reddit, so LLM retold the story.



Post a comment