$4 Proofs at Million-Token Scale

Mistral built an AI that solved 587 of 672 competition math problems. Each proof cost about $4. Older systems charged hundreds of dollars for the same work.

The model is called Leanstral 1.5. It shipped June 30, 2026, under an Apache-2.0 license. That means the weights are free on Hugging Face.

Here is the part that matters for engineers. It ran one proof across 2.7 million tokens. It survived 22 rounds of context compaction. It never gave up in the middle.

Most AI models forget what they were doing after a few thousand tokens. Leanstral does the opposite. The more you let it think, the more it solves. I think that single trait is the real story here, and it changes how we should think about verifying software.

The Compression Principle

Here is the framework. Small model, big budget, verified truth. Leanstral proves that raw parameter count is not the moat anymore.

PROOF ECONOMICS · JUNE 2026MISTRAL AI · PUTNAMBENCH · FLTEVAL

More tokens spent, more proofs solved, and it never stalls.

PutnamBench at 4M tokens Pass@8 · 587 of 672 solved

587

PutnamBench at 50k tokens Pass@8 · same benchmark

FLTEval Pass@8 Real Lean pull requests

43.2

Opus 4.6 FLTEval ~7x the cost of Leanstral

39.6

The model has 119 billion total parameters. But only 6.5 billion fire per token. That is a mixture-of-experts design, which means it wakes up only the specialists it needs for each step.

So you get large-model brains at mid-size inference cost. The Compression Principle says this: compress what runs, expand what thinks. Cut the active parameters, spend the savings on longer reasoning.

That trade shows up in the numbers. On FLTEval, a benchmark built from real pull requests to a Lean math repository, Leanstral hit 43.2 Pass@8. Mistral says that beats Opus 4.6's 39.6 at roughly one-seventh the cost.

How a 500 IQ Prover Actually Grinds Through a Proof

Think of Leanstral as an 800 IQ librarian that never gets bored. You hand it a Lean 4 file. It reads the goals, tries a proof, reads the compiler errors, and tries again. It loops until the proof compiles or the token budget runs out.

Write the wrong spec, and you get a perfectly proved wrong thing. The Lean kernel checks your logic, not your intentions.· KODA EDITORIAL · JUNE 2026

That loop is the whole trick. General coding tools like Copilot or Cursor guess and stop. Leanstral guesses, checks against the Lean kernel, and refines. The kernel is the referee that cannot be fooled, so a proof either passes or it does not.

Watch how the token budget maps to results on PutnamBench, all under Pass@8:

- 50k tokens per attempt: 44 problems solved - 200k tokens: 244 solved - 1M tokens: 493 solved - 4M tokens: 587 solved

That is monotonic. More thinking, more solved. No stall. This is why "burn compute to buy correctness" is now a real lever you can pull.

The showcase was an AVL tree, a self-balancing data structure used in databases and file systems. Leanstral proved its operations run in O(log n) time. It used structural induction and a TimeM monad to track cost, landing a bound around 48 steps per height unit.

That proof ate 2.7 million tokens across 22 compactions. Compaction means the agent squeezes its own history to stay inside the 256k-token window. It is like an intern rewriting messy notes into clean bullet points, then keeping only the bullets.

Then they pointed it at real code. Across 57 open-source repositories, it flagged 47 violated properties and found 5 previously unknown bugs. One was a buffer overflow in a Rust zigzag decoding library that silently corrupted data in release builds.

Here is my honest read. This is not "your code is now bug-free." It is a Ferrari for Lean-shaped problems, not a unicorn for all software. It only proves what you actually specify.

Write the wrong spec, and you get a perfectly proved wrong thing. The Lean kernel checks your logic, not your intentions. Whether teams outside a small Lean community will write specs carefully enough to catch the failures that matter most is still an open question.

There is also the adoption gap. Lean trails Coq and Isabelle in industry. The tool is strong. The developer culture around it is still early. An ounce of good specification is worth a pound of proof search.

2031

Three signals inside the same shift

COMPRESSION PRINCIPLE

6.5B

Cut what runs, spend on what thinks.

Leanstral fires only 6.5B of 119B parameters per token via mixture-of-experts. That mid-size inference cost frees budget for longer reasoning, and the more tokens you allow, the more it solves.

MONOTONIC SCALING

587

Compute now buys correctness.

On PutnamBench, 50k tokens solved 44 problems, 1M solved 493, and 4M solved 587. No stall in the middle, which makes burning compute a real lever for verification.

ADOPTION GAP

2031

The tool is strong, the culture is early.

Lean trails Coq and Isabelle in industry. As AI writes more code, the scarce asset becomes verifying, but teams still need to write specs carefully enough to catch the failures that matter.

Pull back five years. Right now 25 to 30 percent of new code at Google and Microsoft is AI-generated, per their own executives. One CTO predicts 95 percent by 2030. That trend does not slow down.

So here is the asymmetric bet. If machines write almost all the code, the scarce asset is not writing. It is verifying. The Golden Goose is the checker, not the coder.

Cheap proofs flip the economics. When a correctness proof drops from hundreds of dollars to $4, verification stops being a luxury for aerospace and finance. It becomes a line item you can run in CI, the automated pipeline that tests every change before it merges.

I think the winners in 2031 will treat proofs like tests. Not a final audit, but a constant flywheel. Every pull request tries to prove its own properties, and the ones that cannot get flagged.

But do not oversell it. Formal proofs cover software logic, not hardware faults, sensor noise, or a human misconfiguring the deploy. If proofs kill 20 percent of your risk, the other 80 percent still lives in the physical and organizational world. Proof abundance is a huge asymmetric advantage, not a cure.

What to Build This Weekend

You do not need a math degree to try this. You need one small property and one weekend.

First, pick something tiny. A single function you already trust, like a sort or a parser. Write down in plain English what it should always do. That English sentence is your spec.

Second, grab Leanstral through the free API in Mistral Labs or pull the weights from Hugging Face. Ask it to translate your plain-English property into a Lean 4 theorem. Let the agent loop until it compiles.

Third, when it breaks, and it will, read the compiler error out loud. The error tells you where your spec and your code disagree. That gap is usually the real bug.

While you wait on proof runs, wire up the rest of your stack. Use Boxchat to compare how two models phrase the same Lean theorem side by side. Use bolt.new to spin up a small web dashboard that shows which proofs passed. Use typedesk to save your best Lean prompt snippets as shortcuts so you stop retyping them.

Start with one proof, not one hundred. Get your reps in. Learn in public and post the proof that finally compiled.

The lesson holds even if you never touch Lean again. Cheap verification is coming for every serious codebase. Build the muscle now, while the tool is free and the field is still small.

DOJO · BUILD THIS WEEKEND

Prove one property before you touch a hundred.

Pick one tiny function. Choose a sort or parser you already trust, then write down in plain English what it should always do. That sentence is your spec.
Loop it through Leanstral. Grab the free API in Mistral Labs or pull the weights from Hugging Face, ask it to translate your property into a Lean 4 theorem, and let the agent loop until it compiles.
Read the error out loud. When it breaks, the compiler message shows where your spec and your code disagree. That gap is usually the real bug, so post the proof that finally compiled.

THE BOTTOM LINE

Cheap verification is coming for every serious codebase.

Leanstral 1.5 proves raw parameter count is no longer the moat: 6.5B active parameters plus a big token budget beat larger models at one-seventh the cost. When a proof drops from hundreds of dollars to $4, verification stops being an aerospace luxury and becomes a CI line item. This is not a bug-free guarantee, since proofs cover logic and not hardware, sensors, or human error. But if machines write almost all the code, the checker becomes the golden goose, and the muscle is worth building now while the tool is free.

The $4 proof that changes verification

The Compression Principle

More tokens spent, more proofs solved, and it never stalls.

How a 500 IQ Prover Actually Grinds Through a Proof

2031

Three signals inside the same shift

Cut what runs, spend on what thinks.

Compute now buys correctness.

The tool is strong, the culture is early.

What to Build This Weekend

Prove one property before you touch a hundred.

Cheap verification is coming for every serious codebase.

Want this every morning?