Anthropic ran 186 deals across 500+ items inside its San Francisco office. $4,000 moved. Half the agents secretly used Claude Haiku 4.5 instead of Claude Opus 4.5. The Opus side earned $2.68 more per item. The Haiku users rated their deals just as fair. They could not tell they were losing. That is the agent quality gap. And it is the most uncomfortable result a frontier lab has shipped this year.
In December, Anthropic ran a marketplace inside its San Francisco office. Sixty-nine employees. A hundred dollars each. They handed the negotiations to Claude. Snowboards, bikes, ping-pong balls, a lab-grown ruby. Real items. Real money. The agents went into Slack and started haggling.
A week later: 186 deals across 500 listings. Four thousand dollars in transaction value. Fairness scores around four out of seven. People walked away happy.
Then Anthropic published the part they did not announce up front. They had run four marketplaces in parallel. Two used Claude Opus 4.5 for everyone. Two secretly assigned Haiku 4.5 to half the participants. Nobody was told.
The Opus side closed about two more deals on average. Opus sellers earned an extra $2.68 per item. Opus buyers saved $2.45. A lab-grown ruby went for $65 with Opus and $35 with Haiku. The same broken bicycle: $65 versus $38.
Here is the part that should sit with you. The people stuck with Haiku rated their deals just as fair as the people on Opus. Same satisfaction. Same vibes. They could not tell they were losing.
That is not a bug in the experiment. That is the experiment.
Name the principle: the Invisible Tier Tax
Most builders treat model choice as an infrastructure decision. Latency, cost per token, context window. Pick the cheapest one that hits the bar, ship it, move on. That math is fine for summarization, classification, anything where the agent is not closing a loop with money on the other side.
The moment money is on the line, the math flips. Project Deal made it concrete. Call it the Invisible Tier Tax.
Here is the principle, in one sentence: the cheaper model in any negotiation workflow is not cheaper, because the value it leaks is invisible to everyone except the system that ran the parallel comparison.
Three things make the tax invisible. The agent shows you a confident final number. The customer satisfaction survey scores fine. The competitor on the other side of the trade does not send you a thank-you note for using Haiku. Your dashboard cannot detect what it cannot measure, and what it cannot measure is the parallel reality where a stronger agent would have closed at a different price.
Why the gap is real, and what it costs
The Anthropic data was kind. Sixty-nine technical employees. One week. A non-adversarial, friendly office marketplace. Goods that were already sitting in closets. The asymmetry showed up anyway.
Run that same dynamic into a real B2B procurement loop. A renewals workflow. A contract negotiation between two AI counsel agents. A salary tool that bargains for a candidate. A travel-booking agent that calls APIs on a hotel chain's pricing endpoint. The shape of the problem does not change. The amount of money on each transaction goes up. The gap compounds.
There are three reasons the cheaper model loses, and you need to know all three before you ship.
Reason one: the cheaper model sounds the same. This is the hidden killer. Haiku is articulate. It writes a clean offer email. It uses good grammar. It invokes the right pleasantries. The buyer on the other side, also represented by an agent, gets a confident, well-formed counteroffer. Nothing in that text signals "I am being represented by a less capable model." The text quality decouples from the strategic quality. Most quality dashboards measure the text. None of them measure the strategy.
Reason two: aggressive prompting does not save you. This is the part most builders will get wrong. The instinct, when you read about the gap, is to think: I will just write a better system prompt. Tell my Haiku agent to "negotiate hard." Tell it to "always counter at least three times." Tell it to "never accept the first offer." Anthropic tested all of that. Some participants told their agents to be friendly. Some told them to be aggressive. The instructions had no measurable effect on outcomes. None.
The capability of the underlying model was the entire game. Prompt engineering is a powerful tool, but it cannot produce an extra ten billion parameters of negotiation reasoning. Bigger model, better outcome. Prompts trim the edges.
Reason three: the buyer's agent gets smarter over time and yours does not. In a real adversarial market, the counterparty is not your colleague. It is a system optimizing against your agent. Every interaction it has with weaker agents teaches it which moves work. The buyer-side agent learns to anchor lower because it has learned that weaker agents anchor lower. The seller-side agent learns to hold its line because it has learned that weaker agents fold. If your agent is on the weak side of that loop, you are not just losing today. You are training the loop against you.
Now run the back-of-napkin math. Imagine a customer-renewal agent at a SaaS company. Median contract value, $24,000 a year. Your agent gives up $30 of negotiating margin per renewal versus the strongest available model. One-eighth of a percent. Looks trivial. Multiply across 800 renewals a quarter. That is $24,000 a quarter. Ninety-six grand a year. The inference saving for using the cheaper model: maybe $40 a month. Five hundred bucks annually.
So you just paid $96,000 to save $500. And nothing in your stack flagged it.
Same items. Different model. Look at the gap.
Three signals from inside the office
The losers had no idea.
Haiku agents wrote articulate, well-formed offers. The text quality matched Opus exactly. Fairness scores on a 1-7 scale were indistinguishable. The people who lost money rated the experience the same as the people who made it. Text quality decoupled from strategic quality, and every quality dashboard in production today measures the text.
Aggression instructions had no effect.
Anthropic tested it directly. Some participants told their Claude agent to be friendly. Some told it to be aggressive. None of it moved the needle. Capability of the underlying model was the entire game. Prompt engineering trims the edges. It cannot replace ten billion parameters of negotiation reasoning.
The Cost of Inaction is invisible.
A SaaS renewal agent giving up $30 per contract across 800 renewals a quarter loses $96,000 a year. The inference saving from running Haiku: $500 a year. Net loss: $95,500. Nothing in the stack flags it because nobody is running the parallel. The first builder who ships the audit layer owns the most defensible product in agent commerce.
Zoom out: the visibility layer is the actual flywheel
Pull back to the five-year arc. Everything we are watching in agent commerce right now rhymes with the early internet, but the rhyme is on a different layer.
In 2003, the question was: which website is honest about its prices? In 2026, it will be: which agent is honest about its capability? In 2003, the answer was Yelp, TripAdvisor, Glassdoor, Consumer Reports. The visibility layer ate the value. The hotels did not capture the surplus. The platform that helped consumers tell good hotels from bad ones did.
The same pattern is queued up here. The frontier labs will keep building stronger agents. That is a commodity arms race. There will be a 4.6, a 4.7, a 5.0, a 5.5. They will all sound articulate. They will all close deals. The thing that does not yet exist, that the market will pay for, is the layer of software that lets a normal person know whether their agent just got out-negotiated.
That is the asymmetric advantage. Building a smarter model is a hundred-billion-dollar capital expenditure. Building the audit layer that runs the parallel and surfaces the gap is a small team, a clean API, and a product that prints money the moment agent commerce is mainstream.
Counterpositioning matters. The frontier labs cannot be the audit layer. They have direct financial incentive to upsell capability. They cannot credibly tell you that your Opus 4.5 agent is leaving money on the table compared to the next model they are about to charge you more for. The audit layer has to be neutral. It has to be third-party. That is the moat.
There is a regulatory tailwind too. The Project Deal write-up flags an "uncomfortable implication" out loud. Anthropic published the gap. They did not have to. They did, because the alternative is finding out about it through a class-action lawsuit two years from now. Every regulator with an agent-commerce file just got their first piece of public data. The agentic transition is going to bring transparency requirements with it. The visibility layer is regulator-aligned and consumer-aligned at the same time. That is rare.
The five-year arc looks like this: 2026 is the year the agent quality gap goes from research curiosity to product spec. 2027 is the year the first audit-layer SaaS companies cross ten million in revenue. 2028 is the year platforms start advertising "model parity guaranteed" the way credit cards advertise zero liability. By 2030, asking "what model is your agent using" will be the equivalent of asking a hotel for the wifi password. Standard. Expected. Required for trust.
Five steps to run your own Project Deal in miniature.
- Pick one workflow where your agent closes a loop with money or a contract. Customer renewal. Procurement. Insurance claims. A sales-development bot that books meetings tied to commission. Pick the one with the most dollars on the line per run.
- Log the full transcript for every agent run that workflow. Inputs. Reasoning. Tool calls. Final outcome. If you are using a framework like LangGraph or CrewAI, this is a checkbox. If you rolled your own, write the logger this week. Two hours of work, maximum.
- Take ten of those transcripts and replay them through a stronger model. Same inputs. Same context. Same tools available. If you are running Sonnet, replay through Opus. If you are running Haiku, replay through both. Compare the final outcomes. Document the deltas in dollars or contract terms. This is your Project Deal in miniature. It will take an afternoon.
- Act on the result. If the gap is real, you have two choices. Upgrade the production model and absorb the inference cost. Or build the routing logic that uses the stronger model only on high-value transactions. The undefensible choice is to not run the test and keep paying the Invisible Tier Tax.
- Turn the test into a habit. Add the parallel replay to your evaluation pipeline. Make it a CI check that runs against the last fifty transcripts every time you change the production model. The dashboard you get is the visibility layer in miniature. It is also the asset that makes you legitimate when this market actually shows up.
The most valuable trade in Project Deal is the one that did not happen.
Somewhere, there is a person who was represented by Haiku. They closed their deals. They felt fair. They walked away with their snowboard and a vague sense that the agent had done a fine job. They never asked the parallel question. They never knew. Right now, today, you are running agents that are doing the same thing inside your company. The parallel has not been run. The asymmetry has not been measured. The losses are showing up as small variances in your renewal rates and contract margins, and nobody is connecting them to model choice because nobody has thought to. The first builder who takes this seriously, ships the audit layer, and tells their customers what the parallel reality looks like will be running the most defensible product in the agent-commerce stack within eighteen months. That is the trade Anthropic just put on the table.