The cake is a lie (and maybe a cult)
Sam Altman, the CEO of OpenAI has been known to say some rather strange things. ("The most successful founders do not set out to create companies. They are on a mission to create something closer to a religion.")
Well, he's truly outdone himself in the most recent post on his personal blog:
"We are past the event horizon; the takeoff has started. Humanity is close to building digital superintelligence, and at least so far it’s much less weird than it seems like it should be."
Dear reader, We have absolutely not anywhere near artificial general intelligence nevermind superintelligence. This passage will delight in the receipts as Altman's audacious claims happen to coincide perfectly with the release of a research paper.
Apple Research team stress tested “reasoning” in LLM models including Claude 3.7 Sonnet, DeepSeek's R1/V3 systems, and OpenAI's o3-mini. Using a series of puzzles, the researchers aimed to assess the models not just on output but on the steps each took to get to the result.
The puzzles used included:
- Tower of Hanoi: A puzzle involving moving discs of different sizes stacked on three pegs, with specific rules about movement.
- Checker Jumping: A puzzle requiring the elimination of pieces by jumping over them.
- River Crossing: A classic logic puzzle where individuals or objects must cross a river under constraints, often involving a limited capacity boat.
- Blocks World: A task that involves stacking blocks in a particular order.
These games yielded 5 key findings:
- Accuracy collapse beyond certain complexity: At a certain point, reasoning doesn't decline– it plummets to zero. This indicates that models lack generalizable problem-solving abilities for planning tasks. Models can't adapt or procure solutions when the structure of the problem is significantly outside their training data
- Decreased effort with increased difficult: We expect models to exert more effort in thinking for complex problems which can be observed in their token usage. Beyond a certain point in complexity, token usage decreases. Much like a bored middle school student, the models effectively gave up on their cognitive efforts.
- Models have three performance zones: Models overthought “easy” puzzles”, excelled at “medium” puzzles”, and with “hard” puzzles these models experienced complete collapse in performance. Capabilities simply don't scale to match complexity.
- Inability to follow explicit instructions: Once again, the models channeled their inner middle schooler for one test which involved telling the models how to solve the problem. The darn things couldn't follow instructions. Instead of executing logical steps in a procedural manner, they primarily engaged in predicting the next word or token based on learned patterns.
- Inconsistent “Reasoning”: Models could solve puzzles requiring 100 complex moves only to fail another puzzle requiring only 5. Researchers concluded that this variability strongly suggests that the successes are more indicative of memorization than reasoning or problem solving.
This matters because if you're driving your business to an AI-first model, you may want to know that the Emporer has no clothes. Klarna replaced 700 workers with AI and was rewarded with a $40B failure. Users turned on Duolingo after its AI pivot faster than a green owl flaps its wings. The latest models are showing us that the smarter they get, the more they hallucinate.
Maybe this is our cue to hedge bets. Maybe we shouldn't go all in on the new Metaverse until they've got the legs figured out. Altman said it himself– some business founders don't want to build a great product. They want to build religions.