Why Thinking Machines Lab’s First Research Drop Matters

Setting the Scene

The AI world just got another heavyweight player. Thinking Machines Lab, founded by former OpenAI CTO Mira Murati, has stepped out of stealth with a record-breaking $2 billion seed round and a mission to make multimodal AI work seamlessly with how people actually interact with the world. The team is stacked with ex-OpenAI talent, including John Schulman and Barrett Zoph, and their public stance is refreshingly open: share the science, share the code, and build research in the open.

Their new blog, Connectionism, launched with a paper by Horace He, tackling one of the most frustrating but under-discussed problems in AI today: nondeterminism in large language model inference.

Auto-generated description: The phrase THINKING MACHINES is displayed in a modern, gray font on a white background.

The Problem: Same Question, Different Answer

If you’ve spent time with ChatGPT or any other LLM, you’ll have noticed it: ask the same question twice, even with settings fixed, and you might get two different answers. That inconsistency is more than a quirk.

For casual use, it’s mildly irritating. But for researchers, businesses, and regulated industries, it’s a dealbreaker. Reproducibility builds trust. If you can’t rely on the same input always producing the same output, you can’t properly audit, debug, or certify AI systems.

Until now, the prevailing theory blamed floating-point quirks and concurrency on GPUs, CPUs, and TPUs. Tiny rounding differences get amplified into big divergences downstream. That explanation wasn’t wrong, but He and his team dug deeper.


The Root Cause: Batch Variability

The breakthrough finding is that the batch size itself is the main culprit. When inference servers handle multiple requests at once, the number of queries in a batch shifts unpredictably. That changes the order in which floating-point operations are executed.

This, in turn, means the math doesn’t always unfold the same way, and those tiny differences balloon into different tokens. It’s not a GPU-only issue, either—CPUs and TPUs show the same inconsistency when batch dynamics shift.

The team’s experiments drove the point home. Running Qwen’s 235B model a thousand times on the same input produced 80 distinct completions. Sometimes Richard Feynman was born in “Queens, New York,” other times “New York City.” Subtle, but real—and problematic if you need rock-solid outputs.


The Fix: Batch-Invariant Kernels

He’s proposed a solution: redesign three critical transformer operations—RMSNorm, matrix multiplication, and attention—so they behave the same no matter the batch size. These batch-invariant kernels make inference predictable, eliminating the “butterfly effect” caused by shifting execution order.

The lab has already shared demonstration code using vLLM to show deterministic inference in practice. There’s a trade-off, though: current implementations run about 60% slower. That’s a painful hit, but not an insurmountable one. As optimisations improve, the reliability gains could far outweigh the performance loss, especially in enterprise and research contexts.


Why This Matters: Trust, Research, and Regulation

Deterministic inference might sound like a niche technical tweak, but its implications are broad:

  • Enterprise trust: Banks, hospitals, and governments can’t adopt models that produce inconsistent outputs for identical inputs. This research gets us closer to dependable AI.
  • Scientific reproducibility: Researchers can compare apples with apples, instead of being tripped up by invisible noise in their training or evaluation runs.
  • Reinforcement learning (RL): Removing random noise from training loops means cleaner, faster, and more efficient model development.
  • Regulatory alignment: With AI oversight frameworks emerging worldwide, determinism will help satisfy audit and compliance requirements.

A Different Kind of AI Company

The bigger story here might not be the technical solution itself, but the way Thinking Machines Lab is choosing to operate. Launching a blog called Connectionism, openly publishing source code, and explicitly stating “science is better when shared” is a notable contrast with OpenAI’s increasingly closed approach.

It positions Thinking Machines as a counterbalance: a well-funded, well-staffed lab that wants to push AI forward in public, not behind locked doors. Their first product, expected in the coming months, will reportedly help researchers and startups build custom models with this kind of deterministic reliability baked in.


Looking Ahead

Nondeterminism isn’t the kind of flashy problem that grabs headlines, but it’s foundational. If Thinking Machines Lab can continue improving performance while keeping inference deterministic, it could set a new benchmark for what “production-ready AI” actually means.

For me, the most exciting part isn’t just the fix—it’s the philosophy. Mira Murati and her team are betting that open science and transparency will build deeper trust in AI. Given the calibre of their backers and their head start on solving problems that others have mostly hand-waved away, this is one to watch closely.

AI Generated Articles