K Koda Intelligence
exploreDeep Dive

OpenAI Is Writing the Test Every AI Model
Will Be Graded Against

OpenAI published a third-party evaluation playbook and a Frontier Governance Framework in back-to-back days, aiming to define what "trustworthy" means before regulators do. The governance arms race is accelerating against a backdrop of massive infrastructure bets: SoftBank plans a 1 GW AI data center in France, while geopolitical tensions surfaced at a protest rally in Ankara on May 30 and Europe's largest nuclear facility at Zhzhia marked May 30 as a pivotal date.

8 MIN READ · BY THE KODA EDITORIAL TEAM · STRATEGY · AI GOVERNANCE
headphones
LISTEN TO THE DEEP DIVE~2 min conversation
smart_display
WATCH THE VISUAL NARRATIVEAnimated breakdown · ~2 min
play_arrow
Play · YouTube
SOFTBANK DC1 GW· SOFTBANK FRANCE PLAN ZHZHIA NUCLEARMAY 30· EUROPE'S LARGEST FACILITY ANKARA RALLYMAY 30· GEOPOLITICAL SIGNAL EVAL FRAMEWORKS8↑ MAY 2026 LANDSCAPE DEEPEVAL STARS15.7K↑ GITHUB · MAY 2026 INSPECT AI EVALS200+↑ UK AISI ANTHROPIC VAL$965B↑ SERIES H REPORTED SOFTBANK DC1 GW· SOFTBANK FRANCE PLAN ZHZHIA NUCLEARMAY 30· EUROPE'S LARGEST FACILITY ANKARA RALLYMAY 30· GEOPOLITICAL SIGNAL EVAL FRAMEWORKS8↑ MAY 2026 LANDSCAPE DEEPEVAL STARS15.7K↑ GITHUB · MAY 2026 INSPECT AI EVALS200+↑ UK AISI ANTHROPIC VAL$965B↑ SERIES H REPORTED

OpenAI published a formal playbook for third-party AI evaluations on May 29, 2026. The document covers independence, reproducibility, and transparency standards for how outsiders should test frontier models. One day earlier, OpenAI released its Frontier Governance Framework, mapping internal safety practices to California's Transparency in Frontier AI Act and the EU AI Act's Code of Practice. Two documents in two days. That is not a coincidence. That is a company trying to write the rules before regulators finish sharpening their pencils.

Here is what most people missed: this is not about safety. Not really. It is about who gets to define what "trustworthy" means for the next decade of AI deployment. And the answer to that question will determine which companies survive, which get locked out of regulated markets, and which builders can ship products without a legal team the size of a small army.

I think we are watching the opening moves of a governance arms race that will reshape the economics of AI faster than any model benchmark ever could.

The Compliance Moat

The core insight fits on a napkin. Call it The Compliance Moat.

GOVERNANCE STACK · MAY 2026OPENAI PLAYBOOK · UK AISI · DEEPEVAL · MIT SLOAN

The evaluation ecosystem is professionalizing at speed.

Major eval frameworks Industry survey · May 2026
8
Inspect AI pre-built evals UK AISI · multi-vendor
200+
DeepEval GitHub stars GitHub · v4.0.3 · May 21 2026
15.7K
Anthropic reported valuation Series H · 2026
$965B

In every maturing technology market, the companies that define the standards end up with a structural advantage over the companies that merely follow them. Telecom had this with 3GPP. Finance had it with Basel accords. Now AI has it with evaluation frameworks.

OpenAI is not just building models. It is building the test that every model will be graded against. That is not a safety investment. That is a moat.

The Compliance Moat works like this. Step one: publish a rigorous, publicly available framework. Step two: align it to existing and emerging regulations. Step three: acquire the tooling that operationalizes the framework. Step four: let competitors scramble to match your standard while you are already living inside it. The company that sets the test always has a head start on passing it.

The Three-Layer Governance Stack and Why It Changes Everything

Zoom into the architecture OpenAI has assembled and you see a deliberate three-layer stack that competitors now have to reverse-engineer.

The company that sets the test always has a head start on passing it. That is not a safety investment. That is a moat.· KODA EDITORIAL ANALYSIS · MAY 2026

It covers instruction-following, conflict resolution, user freedom, and safety constraints. This is the "constitution" layer. It tells the model what it should be.

Layer two: evaluation infrastructure. The third-party playbook and external testing program create a formal process for outsiders to confirm or challenge safety claims. OpenAI's playbook explicitly targets three dimensions: capabilities (what the model can do), safeguards (what prevents misuse), and validity of the evaluation itself (whether the test actually measures what it claims to). That third dimension is the subtle one. It means OpenAI is not just asking "did the model pass?" but "was the exam any good?" That is a level of methodological self-awareness most labs have not even started thinking about.

Layer three: regulatory alignment. The Frontier Governance Framework maps the first two layers onto specific legal obligations. This layer turns technical evaluations into compliance artifacts.

Now consider the asymmetry this creates. By May 2026, the AI evaluation landscape had reached what one industry guide called "an inflection point," with eight major frameworks split across five commercial platforms and three open-source standards. Inspect AI, maintained by the UK AI Security Institute, ships 200-plus pre-built evaluations covering providers from OpenAI to Anthropic to Mistral. DeepEval hit version 4.0.3 on May 21, 2026 with 15,700 GitHub stars. The ecosystem is professionalizing fast.

But here is the contrarian read that deserves honest weight. It is unclear whether a company that builds frontier models can also credibly architect the evaluation regime for those models. The independence problem is real. Many "third-party" evaluators depend on labs for model access, funding, and prestige. That creates subtle incentive structures that can soften criticism without anyone explicitly asking for it. Evaluation shopping, where labs spotlight favorable reports and quietly shelve harsh ones, is not hypothetical. It is a predictable consequence of the current structure.

There is also a false assurance problem. MIT Sloan research shows that the more diverse methods organizations use to check third-party AI tools, the more failures they surface. Any single framework will miss things. If governments start treating "passed an OpenAI-style third-party eval" as sufficient for deployment, we have replaced genuine oversight with a sophisticated form of safety theater. Goodhart's Law applies here with full force: when the measure becomes the target, it ceases to be a good measure.

My read is that the governance arms race is real and probably inevitable, but the outcome depends entirely on whether independent, adversarial evaluation capacity keeps pace with lab-defined frameworks. The UK AI Security Institute's Inspect AI is a promising signal. State-backed evaluation suites that cover multiple vendors create a counterweight to any single lab's agenda-setting power. Apollo Research and AVERI have called for white-box access to frontier models for third-party evaluators, meaning access to model internals, not just API endpoints. That push matters. Black-box testing of systems this complex is like inspecting a bridge by driving over it once and declaring it safe.

The smaller-lab problem is equally serious. Sophisticated evaluation pipelines require substantial compute, security infrastructure, and specialized staff. Those costs are trivial for a company valued in the hundreds of billions. They are potentially fatal for a 12-person startup or an academic research group. If compliance regimes adopt evaluation standards that effectively require frontier-lab-scale resources to satisfy, we will have built a regulatory moat that entrenches incumbents and labels open-source projects as "unsafe" by default. That would be a terrible outcome for the field.

Anthropic's recent Series H reportedly valued the company at roughly $965 billion. At that scale, credible safety claims are not a nice-to-have. They are a direct competitive differentiator that affects enterprise sales, regulatory access, and investor confidence. Every major lab now has a financial incentive to invest heavily in governance infrastructure. The question is whether that investment produces genuine accountability or just better-dressed compliance paperwork.

2031

Three signals inside the same shift

INDEPENDENCE GAP
8

Eight frameworks, but most evaluators still depend on labs for access.

Many third-party evaluators rely on frontier labs for model access, funding, and prestige. This creates subtle incentive structures that can soften criticism without anyone explicitly asking for it. Evaluation shopping is a predictable consequence, not a hypothetical one.

COMPLIANCE MOAT
2031

By 2031, third-party AI evaluation becomes a multi-billion-dollar market.

Companies investing in evaluation infrastructure today will hold compounding advantages not because their models are better, but because they can prove trustworthiness in ways regulators accept. The California Transparency Act and EU AI Act Code of Practice are early drafts of far more prescriptive requirements ahead.

SMALL LAB SQUEEZE
200+

State-backed eval suites may be the only counterweight to lab-defined standards.

The UK AISI's Inspect AI ships 200-plus pre-built evaluations covering multiple providers, offering smaller teams a path to compliance. But if evaluation standards effectively require frontier-lab-scale resources, open-source projects risk being labeled unsafe by default. The cost asymmetry is the real threat to competition.

Pull back five years and the pattern gets clearer.

We are living through a phase transition in AI governance. The shift is from "trust us, we tested it internally" to "prove it to an independent evaluator using a documented, reproducible methodology." That transition mirrors what happened in financial services after 2008, in pharmaceutical development after thalidomide, and in aviation after every major crash investigation. Industries that produce systemic risk eventually get systemic oversight. The only variable is how much damage happens before the oversight framework matures.

By 2031, I expect three things to be true. First, third-party AI evaluation will be a multi-billion-dollar market segment with its own professional certifications, audit standards, and liability frameworks. The eight frameworks that existed in May 2026 were the seed stage. Second, regulatory regimes in the US, EU, and likely China will require documented third-party evaluations as a precondition for deploying frontier models in high-stakes domains like healthcare, finance, critical infrastructure, and defense. The California Transparency in Frontier AI Act and the EU AI Act's Code of Practice are early drafts of what will become much more prescriptive requirements.

Third, and this is the asymmetric bet worth watching, the companies that invest in evaluation infrastructure today will have compounding advantages in 2031. Not because their models are necessarily better, but because they can prove their models are trustworthy in a way regulators accept. Salary buys furniture. Compliance infrastructure buys market access.

I genuinely don't know whether open-source models will be helped or hurt by this transition. The optimistic case: state-backed evaluation tools like Inspect AI democratize the compliance process, letting smaller teams run rigorous evaluations without building the infrastructure from scratch. The pessimistic case: compliance costs become a barrier to entry that concentrates the market around three or four well-funded labs. The truth will probably land somewhere in between, but builders should plan for the pessimistic scenario and be pleasantly surprised if the optimistic one materializes.

The deepest strategic question is not about any single framework or regulation. It is about impermanence. Every standard published today will be outdated within 24 months as model capabilities evolve. The real competitive advantage is not passing today's test. It is building the organizational muscle to adapt to tomorrow's test before your competitors even know the questions have changed.

What to Build This Weekend

You do not need a legal team to start preparing for the governance shift. Here is what you can do this week.

Step one: audit your current evaluation process. If you are building on top of any foundation model, write down exactly how you test it before deployment. If the answer is "we prompt it a few times and see if the output looks right," you have work to do. OpenAI Evals is MIT-licensed and free. Clone the repo. Run one benchmark against your use case. That is your baseline.

Step two: document your safety claims. Open a simple doc. Write three sentences about what your application does, what could go wrong, and what you have done to prevent it. This is not bureaucracy. This is the minimum viable compliance artifact that every regulated buyer will eventually ask for. Tools like HitPublish can help you turn internal documentation into structured content that lives alongside your product.

Step three: stress-test one decision. Pick the highest-stakes decision your AI system makes and try to break it. Feed it adversarial inputs. Check for hallucination. Record what happens. Choosaro can help you structure the comparison if you are evaluating multiple model options or configurations side by side.

Step four: share what you learn. The evaluation ecosystem is still young enough that your findings matter. Write up your results. Post them publicly. The builders who learn in public and contribute to shared evaluation knowledge will have more credibility when compliance requirements tighten. Hello Humans can turn your findings into a shareable audio format if writing is not your preferred medium.

None of this requires a CS degree. None of it requires frontier-lab resources. It requires 3 hours this weekend and the willingness to treat evaluation as a first-class part of your build process, not an afterthought you bolt on before launch.

The governance arms race is coming whether you participate or not. The builders who start now, even with small, imperfect steps, will be the ones who are not scrambling when the rules arrive.

DOJO · BUILD THIS WEEKEND

Map your model's compliance surface before someone else does.

  1. Audit your current eval stack against all three dimensions. OpenAI's playbook targets capabilities, safeguards, and evaluation validity. Run your existing tests through that lens and document which dimension each test actually covers. Most teams will find the third dimension completely empty.
  2. Deploy Inspect AI on a staging model this weekend. Pull the UK AISI's open-source suite, select 10 evaluations relevant to your deployment domain, and run them against your latest checkpoint. Compare results to your internal benchmarks and note every discrepancy. That gap is your compliance risk.
  3. Draft a one-page regulatory alignment map. List every jurisdiction where you plan to deploy, match each to the relevant regulation (EU AI Act, California Transparency Act, or equivalent), and identify which of your current evaluation artifacts satisfy which legal obligation. Gaps on this map are gaps in your market access.
THE BOTTOM LINE

The real moat is not the model. It is the proof the model is trustworthy.

OpenAI is not just building AI. It is building the governance infrastructure that will define how every AI system gets judged. The companies that invest in evaluation architecture today will hold compounding advantages in market access, regulatory approval, and enterprise trust by 2031. But this only works for the industry if independent, adversarial evaluation capacity keeps pace with lab-defined frameworks. Builders who wait for regulators to hand them a checklist will find the checklist was written by their competitors.

Want this every morning?

AI analysis, world news, markets, and tools. One briefing, delivered free.

One email per day. No spam. Unsubscribe anytime.