We playin chess now
- Top LLMs: o3, Grok 4, Gemini 2.5 Pro & Flash, o4-mini, Claude Opus 4, DeepSeek R1, Kimi k2
- NO special training, just testing general cognitive skills
- Finals were commentated by the Chess GOAT, Magnus Carlsen himself (in jeans👖, if you know, you know)
Why chess when engines are already so good?
Stockfish would absolutely dunk on these LLMs. But that’s not the point. The whole idea was to see how they handle reasoning and planning in a world with hard rules and clear outcomes.
Chess is perfect for that:
- You gotta track a bunch of relationships at once
- You gotta think ahead
- Has a rich history of AI research to create a baseline for improvement
What stood out
- LLMs made mistakes that no chess engine would ever make
- Some literally forgot where pieces were or how they moved
- A bunch of games ended not in checkmate, but because they burned 4 tries making illegal moves
Grok 4 vs… o3!?
Grok 4 looked unstoppable at first. 4-0 slap on Gemini “Flash,” then rolled past Gemini Pro in the semis. It gave little explanation, played with confidence, and made it all look effortless.
Then came o3.
Different style: “chatty”, dropping commentary about its moves, showing what it understood about the game state, and methodically thinking through its next move. It too had little trouble in its bracket, though it played slower. Even then, most people didn’t think it would out-muscle Grok after seeing its earlier dominance.
Game 1: Grok dropped a bishop and tried to trade even though it was already behind, which spiraled into a checkmate.
Game 2: It sent its queen after a pawn, not realizing it was defended. Collapse followed quickly. Commentators even speculated that maybe Grok was cooking up some galaxy brain trap.
By games 3 and 4, the wheels fully came off. Disaster after disaster, and o3 shocked everyone when it CRUSHED Grok 4 with a clean sweep, not dropping a single game.
Final score: o3 4-0 Grok. Rekt.
The takeaway
o3 had a strong command of the state of the game. It made “don’t mess up” the main goal instead of chasing flashy wins. Keep the state clear, and reduce the errors to come out on top.