Apple’s AI Reasoning Study: The Pushback Begins

Read Part 1 here

Last week, Apple made a bold claim: today’s smartest AI models don’t actually “think”—they just pretend really well. Their research, called The illusion of thinking showed that when you throw difficult logic puzzles at these chatbots, they fall apart. But the story doesn’t end there.

Over the weekend, the backlash began.

Researchers, developers, and AI companies have started picking apart Apple’s study—and some of the counterpoints are pretty convincing. So let’s dive into what’s happening now, what people are saying, and why it all matters.


Quick Reminder: What Did Apple Say Again?

Apple tested some of the top AI models—like Claude 3.7, GPT-4o, and Google’s Gemini—on logic puzzles like the Tower of Hanoi and River Crossing.

Their conclusion?

  • Easy problems? Basic models did better.
  • Medium difficulty? The “reasoning” models pulled ahead.
  • Hard puzzles? Everyone failed. Even when given help.

According to Apple, this proves that these models aren’t actually reasoning—they’re just mimicking patterns they’ve seen before.


Not So Fast: Lawsen and Anthropic Respond

The most detailed response so far comes from Alex Lawsen, with support from Anthropic (the team behind Claude). They’ve argued that Apple’s test setup was flawed—and that the AI models might be smarter than Apple gave them credit for.

Here’s what they pointed out:

1. Token Limits Are a Real Thing

Apple asked the models to solve puzzles step-by-step—writing out every move in detail. But the bigger the puzzle, the more text that takes. Eventually, the models just ran out of space to keep going. That’s not failure—it’s like trying to explain something but being told you can only use one page.

2. Some Puzzles Were Impossible

In the River Crossing test, Apple included puzzle setups that literally couldn’t be solved. But when the models correctly said, “This can’t be done,” Apple still marked them as wrong.

3. Change the Question, Get a Different Answer

When researchers asked the same AI models to solve the Tower of Hanoi by writing a small bit of computer code (instead of listing every move), they passed with flying colours—even on puzzles much harder than the ones Apple used.

So maybe the problem wasn’t the AI—it was the way the test was written.

You can read the full response here:
The Illusion of the Illusion of Thinking – Anthropic & Lawsen


The Wider Buzz: This One’s Got Legs

Lawsen’s response has kicked off a proper conversation in the AI world. You’ll find threads on X (Twitter), Reddit, and various AI forums all debating what’s going on.

Here’s what else is happening:

  • Engineers are running their own tests to try out Apple’s puzzle prompts using different formats.
  • Researchers are calling for better ways to test reasoning, saying puzzles like these don’t really show how AI thinks.
  • More papers and preprints are expected soon—this is the kind of research that tends to snowball.

In short, it’s not just one rebuttal. There’s a growing crowd saying: Apple’s findings aren’t wrong exactly, but they’re based on a narrow, maybe even unfair setup.


So Who’s Right?

To quote TV Burp:

“There’s only one way to find out… FIGHT!”

In reality, it’s not that dramatic. This is a normal part of science—one group puts out a bold claim, others test it, challenge it, or expand on it.

But this back-and-forth does matter. It could influence how companies evaluate AI, where research goes next, and even how investors think about what these models can actually do.


The Big Lesson Here?

Just because an AI struggles with a specific test doesn’t mean it’s broken—or dumb. And just because a model sounds clever doesn’t mean it truly understands what it’s doing.

The truth is more complicated, and that’s why this debate is a good thing. It pushes everyone to ask better questions and build better tools.

And let’s be honest—a little rivalry is good for everyone. It keeps the big players on their toes, pushes research forward, and gives us—the people actually using these tools—new features, better performance, and more transparency.


Stay Tuned

This conversation is far from over. Expect more blog posts, academic papers, and spicy Twitter threads in the weeks ahead. And if you’re using AI in your work, here’s the takeaway:

  • Be aware of token limits—they’re real and they matter.
  • Be smart about how you ask a model to solve a problem.
  • And don’t take AI claims (or critiques) at face value—dig into the details.

More soon.