That is faster than most startups ship a mobile app. They call it Jalapeño, and Broadcom's CEO Hock Tan says it performs on par with Nvidia's Blackwell and Google's TPUs for inference. Early tests reportedly show roughly 50% better cost efficiency versus standard AI GPUs. Here is why that number should change how you think about deploying LLMs at scale.
OpenAI and Broadcom unveiled Jalapeño on June 24, 2026, in San Francisco. It is their first custom silicon, co-developed with Broadcom, and built for one job only: running large language models in production. Not training. Inference. That distinction is the whole story, and most builders are missing it.
The Inference Tax Principle
Here is the framework: every token you serve carries an Inference Tax.
Four numbers behind the shift to purpose-built inference silicon.
That tax is the cost of running a model after it is already trained. Training is a one-time bill. Inference is the meter that never stops. Every chat reply, every API call, every agent loop adds to it.
For years, that meter ran on general-purpose GPUs. Nvidia's H100 and Blackwell chips do training and inference both. That flexibility is great for a research lab. It is expensive for a company serving billions of queries a day.
The Inference Tax Principle is simple. The phase that repeats forever should run on hardware built for that phase. You do not buy a Ferrari to deliver pizzas. Jalapeño is OpenAI's purpose-built delivery truck for tokens.
The numbers make the case. If Jalapeño cuts cost-per-query by ~50%, OpenAI halves the meter on its biggest recurring expense. That is not a feature. It is a structural shift in unit economics.
Why a Specialized Chip Wins the Utilization Game
Think of a data center as a system with three bottlenecks: compute, memory, and networking. The chip that wins is not the one with the most raw power. It is the one where all three resources stay busy at the same time.
This is the real problem with general GPUs running LLM inference. They often run well below their theoretical peak. Memory bottlenecks, small batches, and network overhead leave expensive silicon sitting idle. You pay for 100% of the chip and use a fraction of it.
OpenAI built Jalapeño to attack exactly this gap. According to OpenAI's June 24 statement, the chip reduces data movement and balances compute, memory, and networking to push realized utilization closer to theoretical peak. In systems terms, they removed the slack from the pipeline.
This is the most underrated part of the announcement. Performance-per-watt headlines grab attention. Utilization is where the real money hides. A chip that runs at 90% effective throughput beats a faster chip stuck at 50%.
Here is the systems lens. When you co-design the model, the kernels, the serving software, and the silicon together, you cut the translation losses between layers. Generic stacks like CUDA carry overhead because they have to serve everyone. A closed loop serves one workload perfectly.
OpenAI proved the loop works in the lab. Engineering samples are already running GPT-5.3-Codex-Spark at production target frequency and power in lab workloads. That means the chip is not a slide deck. It is running real models today.
Now the honest part. OpenAI has published zero benchmark numbers. No FLOPs, no TOPS, no MLPerf results. The ~50% figure comes from Broadcom's CEO, not an independent test.
It is unclear whether that advantage survives full deployment at gigawatt scale. Yield problems, driver bugs, and cooling at scale can erase lab gains fast. The data is mixed on how cleanly early marketing numbers translate into sustained system cost wins. My read is that the direction is right even if the exact percentage moves.
There is also a trade. OpenAI is swapping dependence on Nvidia for dependence on Broadcom and TSMC. Celestica will contribute board, rack, and system integration expertise, and Reuters reports those systems serve OpenAI exclusively. You cannot buy a Jalapeño chip. The specialization that makes it cheap also makes it locked.
2031
Three signals inside the same shift
The meter that never stops just got cheaper.
Training is a one-time bill, but inference is a recurring meter on every token served. If Jalapeño cuts cost-per-query by roughly 50%, OpenAI halves its biggest recurring expense. That is a structural shift in unit economics, not a feature.
Specialization removes the slack from the pipeline.
General GPUs run well below peak because memory bottlenecks and network overhead leave silicon idle. By co-designing model, kernels, software, and chip, OpenAI pushes realized utilization closer to theoretical peak. Engineering samples already run GPT-5.3-Codex-Spark at production frequency in lab workloads.
The infrastructure map is splitting in two.
OpenAI follows Amazon's Inferentia, Google's TPUs, and Microsoft's Maia behind proprietary APIs. By 2031, expect vertically integrated giants serving cheap tokens and a commodity GPU market for self-hosters. The gap between those two cost curves decides margins for the next decade.
Pull back five years. The pattern here is not really about one chip. It is about who controls the inference layer.
OpenAI is following a path Amazon and Google walked first. AWS built Inferentia and markets up to 40% lower inference cost on certain workloads. Google built TPUs and runs Search and Gemini on them. Microsoft built Maia. Now OpenAI joins the club with a stated target of 10 gigawatts of custom-chip compute by 2029.
The strategic shift is this. The cheapest, most reliable inference capacity is moving behind proprietary APIs. It is leaving the public cloud's generic GPU menu. That is a counterpositioning move you cannot easily copy by renting hardware.
Here is the asymmetric bet for builders. If you build on top of OpenAI's API, you inherit Jalapeño's economics without owning a single chip. Lower cost-per-token could make always-on agentic workloads profitable that were uneconomical in 2025. The downside is deeper lock-in to one vendor's runtime.
If you self-host, the calculus flips. You cannot buy a Jalapeño chip, so your real choice set stays Nvidia, AMD, TPUs, and Trainium. You inherit none of the savings and all of the GPU market price.
By 2031, I expect the AI infrastructure map to split in two. Vertically integrated giants serving cheap tokens behind APIs. And a commodity GPU market for everyone running their own models. The gap between those two cost curves is the thing every builder needs to watch. Where you sit on that map decides your margins for the next decade.
What to Build This Weekend
You cannot buy a Jalapeño chip. But you can build the discipline that makes its economics work for you. The goal this weekend is to measure your own Inference Tax.
First, instrument your token usage. If you use the OpenAI API, log every call with input tokens, output tokens, and the model used. A simple spreadsheet works. You need the raw data before you can cut anything.
Second, calculate your cost-per-outcome, not cost-per-token. Take your total monthly API spend and divide by the number of real outcomes your product delivers. That might be support tickets resolved or documents summarized. This number is the one that actually matters.
Third, find your waste. Look for calls using a large model where a smaller one would do. Look for verbose prompts you can trim. An ounce saved in prompt design beats a pound spent on bigger infrastructure.
Fourth, build a model-routing layer. Send easy queries to a cheap model and hard ones to the frontier model. Tools like n8n or a simple if-statement in your code can do this. Routing is your personal version of what OpenAI did with specialized hardware: match the job to the right resource.
Expect things to break. Your first router will misroute queries. Test it with real traffic and watch the logs. Fix one rule at a time.
The lesson from Jalapeño is not about silicon. It is about owning your unit economics. You can cut a meaningful chunk of yours this weekend with a spreadsheet and a router. Start there.
Measure and cut your own Inference Tax.
- Instrument your token usage. If you use the OpenAI API, log every call with input tokens, output tokens, and the model used. A simple spreadsheet works, but you need the raw data before you can cut anything.
- Calculate cost-per-outcome, not cost-per-token. Divide total monthly API spend by the number of real outcomes your product delivers, such as support tickets resolved or documents summarized. This is the number that actually matters.
- Build a model-routing layer. Send easy queries to a cheap model and hard ones to the frontier model using tools like n8n or a simple if-statement. Expect your first router to misroute, then fix one rule at a time.
The lesson from Jalapeño is not about silicon. It is about owning your unit economics.
OpenAI built a purpose-built delivery truck for tokens, and the honest caveat is that it has published zero benchmarks, so the 50% figure comes from Broadcom's CEO rather than an independent test. The direction is right even if the exact percentage moves. If you build on the API, you inherit Jalapeño's economics without owning a chip, at the cost of deeper lock-in. Either way, the discipline that makes specialized hardware pay off is yours to copy: match each job to the right resource and start with a spreadsheet and a router this weekend.