The 11 Day Treadmill

Seven frontier models shipped in 78 days between February and April 2026. That is one new "state of the art" system every 11 days. In 2024, the entire industry managed three or four major releases per year. Anthropic alone pushed Claude Opus 4.6, Sonnet 4.6, and Opus 4.7 across a single quarter. OpenAI went from GPT-5.3 to GPT-5.5 with roughly six weeks between versions.

And here is the part nobody wants to say out loud: the top score on WhatLLM's Intelligence Index has not moved since April. GPT-5.5 xhigh hit 60.24 and the ceiling held through mid-May. The version numbers kept climbing. The capability gains did not.

So the question is not "which model is best?" The question is "does the version number even tell you what you need to know anymore?" I think the answer is no. And that forces a different kind of infrastructure decision than most teams are making right now.

The Version Treadmill Trap

Here is a framework for what is happening. Call it the Version Treadmill Trap.

RELEASE VELOCITY · MAY 2026ANTHROPIC · GOOGLE · ALIBABA · OPENAI

Four labs, four version surges, one stalling capability ceiling.

Claude Opus Anthropic · latest flagship

4.7

Gemini Pro Google · Cloud Next 2026

3.1

Qwen Series Alibaba · rapid iteration

3.5-3.6

Gemma Open-Weight Google · pre-I/O release

It works like this. A lab ships a new model. Your team evaluates it. You start migrating prompts, testing edge cases, updating your pipeline. Before the migration finishes, the next version drops. You pause. Evaluate again. Restart. The work compounds but the progress does not.

The Version Treadmill Trap has three symptoms. First, your engineering team spends more time evaluating models than building product. Second, your "model selection" decisions have a half-life shorter than your sprint cycle. Third, you optimize for a version number instead of a capability contract.

The escape is simple to describe and hard to execute: stop choosing models and start choosing capability tiers. Pin your production system to a stability guarantee, not a version string. Treat the model like a utility with a service level, not a product with a changelog.

Google Cloud AI adoption hit 75% of customers at Cloud Next 2026, up sharply from prior quarters. That number tells you enterprises are locking in infrastructure commitments right now. They are choosing platforms, not point releases. The teams that escape the treadmill are the ones building above the version layer, not inside it.

Why Your Model Selection Process Is Already Obsolete

The real problem is structural, and it requires systems-level thinking to solve. Most teams still evaluate AI models the way they evaluated software libraries in 2019: pick the best one, integrate it, revisit in six months. That process assumed slow-moving dependencies. It assumed version numbers carried stable semantic meaning across months or years.

Stop choosing models and start choosing capability tiers. Pin your production system to a stability guarantee, not a version string. Treat the model like a utility with a service level, not a product with a changelog.· KODA ANALYSIS · MAY 2026

Neither assumption holds in May 2026.

OpenAI compressed the gap between GPT-5.4 and GPT-5.5 to about six weeks. Google runs a two-track strategy with a slower flagship Gemini line and rapid iteration on Flash and Nano variants. DeepSeek ships every release with a full technical report and open weights under MIT license.

The interesting shift, though, is not the speed. It is the control surface. You no longer pick "Claude Opus 4.7 versus GPT-5.5." You pick a model family and then dial a knob: non-think, think, think-max. The product surface moved from model selection to effort selection within a single family.

This changes the architecture of your entire AI stack. Think of it like a thermostat versus a furnace replacement. The old approach: rip out the furnace every quarter and install a new one. The new approach: install one system and adjust the temperature dial based on the task.

Here is where teams get this wrong. They build thick abstraction layers to be "model agnostic" and then auto-upgrade to the latest version without re-evaluating. That creates two dangerous gaps.

Gap one: hidden behavioral regressions. Minor model changes can break subtle prompt-dependent logic even when aggregate benchmarks improve. Your extraction schema that worked on Opus 4.6 might silently degrade on 4.7 because the model's default formatting shifted. Benchmarks went up. Your production accuracy went down.

Gap two: security exposure. Rubrik's May 2026 analysis put it bluntly: every frontier model release is now also a cyber capability release. Each model upgrade changes your attack surface. New tool-use behaviors. New jailbreak vectors. New prompt-injection failure modes. If you auto-upgrade without security re-assessment, you are deploying capabilities you have not threat-modeled.

The system that actually works looks like this. Build three layers.

Layer one: a capability contract. Define what you need in plain language. "Summarize 50-page legal documents with 95% factual accuracy in under 8 seconds." That contract does not mention a model name.

Layer two: a pinned production version. Lock a specific model and reasoning-effort level that meets the contract. Do not touch it until you have run a full evaluation cycle on the replacement candidate. Digital Applied's data shows labs are now providing explicit migration playbooks and backwards-compatible API contracts. Use them.

Layer three: a canary pipeline. Run every new model release through your evaluation suite automatically. Compare against the pinned version on your specific tasks, not on public benchmarks. Only promote to production when the canary passes your contract thresholds and a security review.

I genuinely don't know whether most teams will build this discipline before they get burned. The 75% Google Cloud adoption number suggests enterprises are committing to platforms fast, possibly faster than their evaluation processes can keep up. My read: the teams that win are the ones treating model upgrades like database migrations, not like app store updates.

METR's analysis of frontier AI safety policies reinforces this. Twelve companies have now published safety protocols requiring evaluations before, during, and after deployment. Anthropic's Frontier Safety Roadmap explicitly notes that frontier models risk taking longer than initial release dates when critical capability thresholds are approached. The responsible labs are building speed bumps. Smart infrastructure teams should build their own.

2031

Three signals inside the same shift

CEILING STALL

60.24

The Intelligence Index has not moved since April.

GPT-5.5 xhigh hit 60.24 on WhatLLM's Intelligence Index and the ceiling held through mid-May. Version numbers kept climbing. Capability gains did not. The rising floor and stalling ceiling compress differentiation between models.

EFFORT ROUTING

3 tiers

Model selection is becoming effort selection within a single family.

You no longer pick Opus 4.7 versus GPT-5.5. You pick a model family and dial a knob: non-think, think, think-max. The product surface has moved from choosing models to choosing reasoning effort per task. Architecture must follow.

PLATFORM LOCK-IN

75%

Google Cloud AI adoption surged to 75% of customers.

Announced at Cloud Next 2026, the 75% adoption figure shows enterprises are committing to platforms, not point releases. Teams escaping the version treadmill are building above the model layer. Those chasing versions are running in place.

Pull back five years from now. The version treadmill is a 2025-2026 problem. By 2031, I expect model identity to matter about as much as your browser version number matters today.

Chrome updates every week. Nobody notices. Nobody cares. The reason is that the interface contract between the browser and the web is stable enough that version changes are invisible to the user. AI infrastructure is heading to the same place, but we are stuck in the ugly middle period where the interface contracts are still forming.

The asymmetric bet right now is on the abstraction layer, not the model layer. The teams building robust capability contracts, evaluation pipelines, and reasoning-effort routing today are building a compounding advantage. Every model release makes their system smarter without requiring a rewrite. Every model release forces their competitors back onto the treadmill.

Consider the Costco hot dog principle. Costco has sold a hot dog and soda for $1.50 since 1985. The ingredients change. The suppliers change. The logistics change. The price and the promise stay the same. That is a capability contract. Your AI infrastructure should work the same way. The model underneath can change every six weeks. The contract your product makes with your users should not.

Context windows standardized at one million tokens across all four major labs in H1 2026. DeepSeek V4 shipped open-weight at frontier quality. SubQ 1M-Preview demonstrated 12 million token context with subquadratic attention. The floor keeps rising. The ceiling, according to the Intelligence Index, has temporarily stalled at 60.24.

This pattern (rising floor, stalling ceiling) is the signal that matters. It means the differentiation between models is compressing. It means the value is migrating from "which model" to "which system around the model." Five years from now, the companies that built systems will own the market. The companies that chased versions will have spent five years running in place.

What to Build This Weekend

You do not need to overhaul your entire stack. You need to build one small thing that proves the concept.

Step one: define one capability contract for your most important AI workflow. Write it in plain English. "Classify support tickets into 12 categories with 90% accuracy in under 2 seconds." No model names. No version numbers. Just the outcome.

Step two: set up a simple evaluation pipeline. Use n8n to wire together an automated test. Pull 100 real inputs from your production data. Run them through your current model. Score the outputs against your contract. Save the results. This is your baseline.

Step three: test the contract against a second model. If you are on Claude, try Gemini 3.1 Pro. If you are on GPT-5.5, try DeepSeek V4. Run the same 100 inputs. Compare. You will learn more about your actual requirements in one afternoon than in a month of reading benchmark tables.

Step four: use Serno to stress-test your decision. Serno pits multiple AI models against each other in structured debate. Feed it your evaluation results and ask it to argue for and against switching. It will surface tradeoffs you missed.

Step five: document the result in Granola. Record a 10-minute meeting with yourself walking through what you found. Granola will turn it into searchable, actionable notes you can share with your team on Monday.

The whole exercise takes three to four hours. You will come out of it with a capability contract, a baseline evaluation, a comparison data point, and documentation. That is more infrastructure maturity than most teams achieve in a quarter of chasing version numbers.

The models will keep shipping. The version numbers will keep climbing. Build the system that makes that someone else's problem.

DOJO · BUILD THIS WEEKEND

Define one capability contract and prove it against two models in an afternoon.

Write one capability contract in plain English. Pick your most important AI workflow. Define the outcome without mentioning a model name or version. Example: "Classify support tickets into 12 categories with 90% accuracy in under 2 seconds."
Build a baseline evaluation pipeline. Use n8n or a simple script to pull 100 real inputs from production data. Run them through your current pinned model. Score outputs against your contract thresholds. Save the results as your baseline.
Test the contract against a second model. If you run Claude, try Gemini 3.1 Pro. If you run GPT-5.5, try DeepSeek V4. Run the same 100 inputs, compare against your baseline, and learn more about your real requirements in one afternoon than a month of benchmark browsing.

THE BOTTOM LINE

Build systems, not version dependencies.

The version treadmill is a 2025-2026 problem. By 2031, model identity will matter about as much as your browser version number. The asymmetric bet right now is on the abstraction layer: capability contracts, automated evaluation pipelines, and reasoning-effort routing that compounds with every release instead of requiring a rewrite. The companies that build systems will own the market. The companies that chase version numbers will have spent five years running in place.

The Version Number Is Dead.
Welcome to the Capability Contract Era.