Home / Learn / What is a world model?
Field note · The frontier
Published July 2, 2026 · Vita Indarra
Short answer: A world model is AI that doesn't describe the world — it simulates one. Give it a single photograph and it imagines the next moment, then the next, fast enough that you can walk around inside the dream like a video game. Nothing is drawn, stored, or programmed; every frame is generated as you move.
The AI most people know predicts the next word. The video generators that followed predict the next frame — but of a fixed clip: you type a prompt, you get a film, you watch it. A world model is a third thing, and the difference is the whole point: it predicts the next frame in response to your actions. Press forward and it imagines forward. Turn left and it invents what was always going to be on your left. It is not a movie of a place. It is, for as long as the dream holds, a place.
Engineers sometimes call it a game engine with no engine. A video game shows you a world by storing one — level files, 3D models, physics code, all built by hand. A world model stores none of that. There is no map, no geometry, no objects. There is only a model that has learned, from enormous amounts of video, how the world tends to behave — and imagines yours into existence one frame at a time, as you move through it.
Training is conceptually simple: show a model vast amounts of video, some of it paired with the actions being taken (a player's inputs, a camera's motion), and make it predict what happens next. Do that at scale and the model is forced to learn the things that make "next" predictable — that rooms have consistent layouts, that objects persist, that walking forward brings the far wall closer. At run time you flip the loop around: current frame plus your control input goes in, the imagined next frame comes out, and it repeats many times a second.
What holds the world together between frames is something like short-term imagination — the model's memory of what it has already shown you. That memory is finite, and it is honest to say so up front, because it's also the technology's defining weakness today.
World models went from a research niche to one of the loudest bets in AI because of what they might mean beyond games. Some of the field's most decorated researchers argue that language alone cannot ground real understanding — that a mind which has never modeled space, objects, and cause-and-effect is reciting the world rather than comprehending it, and that models which learn by predicting the world will understand it more deeply. That's the "spatial intelligence" argument, and it is a serious bet, not a settled fact.
The nearer-term uses are more concrete: robots and AI agents can practice in imagined environments instead of expensive, breakable real ones; a machine can rehearse an action in a simulated copy of a situation before taking it for real; and creative worlds stop needing builders at all — a photograph becomes a walkable set.
This is the part demo reels skip, and the reason we published a field guide instead of a hype piece. The author of The World in the Machine ran an openly released world model on a single consumer graphics card — no lab, no cluster — seeded it with a photo of his own room, and walked in. The first minutes are genuinely uncanny: a coherent, controllable world at video-game frame rates, conjured from one image of a real place.
And then the dream frays, gorgeously and instructively. Look away and look back, and the room has quietly redecorated. Objects morph mid-glance. Walk far enough and you are somewhere that never existed, with no way back — the model's short-term imagination has a horizon, and past it, the world is being invented rather than remembered. Today's open models hold coherence for minutes, not hours. Anyone telling you otherwise is selling something.
The mundane limits — drift, morphing, approximate physics — will improve with scale, and quickly. The deeper catch won't fix itself: a world model is the end of "seeing is believing." We have spent a few years learning to distrust AI images and clips. An explorable fake — a place you can walk through, look around, examine from any angle — is a new category of persuasive, because inspection is exactly how humans have always separated real from fake. When anyone can author a reality, the question of who authored the one you're looking at stops being philosophical.
No. A video generator produces a fixed clip you watch; a world model responds to your actions in real time. Interactivity is the line — one is a film, the other is a place.
Yes — openly released world models now run on a single consumer graphics card at playable frame rates, which is exactly how the book behind this note was written. Expect wonder and drift in equal measure.
Some leading researchers believe so; the argument is that prediction of the world grounds understanding in a way language can't. Treat it as a live scientific bet, not a conclusion.
Go deeper
This note is the trailhead of a short, vivid book: what world models unlock, what they break, and a first-person walk through a dreamed copy of a real room — including exactly where the dream frays. The World in the Machine — written by someone who actually ran one, honest about every limit. Live on Amazon.