K Koda Intelligence
exploreDeep Dive

The 100x AI scaling claim is a composite, not a benchmark

FP4 quantization stores each weight in 4 bits instead of 16, and the math compounds across memory, throughput, and energy. NVIDIA reports NVFP4 cuts memory about 3.5x versus FP16 and up to 50x energy efficiency per token on Blackwell Ultra versus Hopper. End-to-end inference reportedly runs up to 2.2x faster than BF16 on a B200. Multiply the honest pieces and you land between 30x and 180x, not a clean 100x.

6 MIN READ · BY THE KODA EDITORIAL TEAM · STRATEGY · AI ECONOMICS
headphones
LISTEN TO THE DEEP DIVE~2 min conversation
smart_display
WATCH THE VISUAL NARRATIVEAnimated breakdown · ~2 min
play_arrow
Play · YouTube
MEMORY CUT3.5x↓ NVIDIA B200 INFERENCE2.2x↑ ICLR 2026 RTX 5090 FP44x↑ VS FP16 ENERGY/TOKEN50x↑ BLACKWELL PER-TOKEN COST1.4x↓ SPHERON 70B WEIGHTS35 GB↓ INT4/AWQ FP4 TRAINING13B· ICML 2025 COMPOSITE100x· FORECAST MEMORY CUT3.5x↓ NVIDIA B200 INFERENCE2.2x↑ ICLR 2026 RTX 5090 FP44x↑ VS FP16 ENERGY/TOKEN50x↑ BLACKWELL PER-TOKEN COST1.4x↓ SPHERON 70B WEIGHTS35 GB↓ INT4/AWQ FP4 TRAINING13B· ICML 2025 COMPOSITE100x· FORECAST

Same model. In some cases, similar quality, give or take 1%. The trick is squeezing each weight into 4 bits instead of 16.

Here is the honest version. The 100x figure is not one benchmark. It is a stack of smaller wins multiplied together. Some of those wins are real and measured. Some are forecasts dressed up as facts. I think the truth sits in the middle, and the middle is still a big deal.

The Compression Principle

Here is the framework: cheaper math compounds. When you cut precision, you do not save in one place. You save in three at once. Memory, throughput, and energy all drop together.

FP4 ECONOMICS · JUNE 2025NVIDIA · SPHERON · ICLR 2026 · ICML 2025

The honest multipliers behind the 100x story.

Memory density vs FP16 NVIDIA · NVFP4 June 2025 blog
3.5x
B200 inference vs BF16 MR-GPTQ · ICLR 2026
2.2x
Energy efficiency per token Blackwell Ultra vs Hopper
50x
Per-token cost vs H100 FP8 Spheron · B200 FP4 cloud pricing
1.4x

Call it The Compression Principle. Lower the bits per number, and the whole system gets lighter at the same time.

FP4 means each weight gets stored in 4-bit floating point instead of 16-bit or 8-bit. A 70B model drops from about 140 GB in BF16 to roughly 35 to 40 GB in INT4/AWQ, per Spheron's published table. Same parameter count, a quarter of the storage.

The "microscaling" part is what makes this work. Instead of one scale factor for a whole tensor, you assign a shared scale to each small block of values, 16 per block in NVFP4 and 32 in the MXFP4 standard. Each block zooms in on its own range. That is how you keep accuracy while crushing the bits.

Why the Multipliers Stack and Where They Break

Let me think about this like a system, not a slogan. A system has inputs, bottlenecks, and a real output. The 100x claim treats every input as if it stacks cleanly. It does not. Some stages bottleneck the others.

FP4 is a genuine 2 to 5x arithmetic win that becomes a 30x-plus system win across hardware generations. The 100x label is the optimistic top of a wide range. Build your plans on the floor, not the ceiling.· KODA EDITORIAL · 2026

Start with the inputs that are actually measured. NVFP4 cuts memory about 3.5x versus FP16, per NVIDIA's June 2025 blog. End-to-end inference throughput reportedly runs up to 2.2x faster than BF16 on a B200 for large LLMs, per the MR-GPTQ work at ICLR 2026. On an RTX 5090, FP4 hits up to 4x end-to-end speedups (6x layer-wise) versus FP16 on the same card.

Then there is the energy number, which is the biggest lever. NVIDIA reports up to 50x better throughput per megawatt on Blackwell Ultra versus Hopper for reasoning inference. That gain is not pure FP4. It bundles denser tensor cores, a better memory subsystem, and liquid cooling.

So multiply the honest pieces: 3x memory density, times 3x throughput per chip, times 10 to 20x generational energy gains. You land somewhere between 30x and 180x cost-adjusted compute. That range is the real basis for the 100x story. It is a composite, not a single result.

Now the bottleneck. A system is only as fast as its slowest stage. Training is that stage. The best peer-reviewed FP4 training work, Wang et al. at ICML 2025, scaled to 13B parameters on 100B tokens. "FP4 All the Way" at NeurIPS 2025 trained a 7B model on 256 Gaudi2 chips up to 200B tokens. Neither touched frontier scale end to end.

There is also a quality crack. We do not yet know whether those errors stay quiet at trillion-parameter scale or compound badly.

My read on this: FP4 is a genuine 2 to 5x arithmetic win that becomes a 30x-plus system win across hardware generations. The 100x label is the optimistic top of a wide range. Build your plans on the floor, not the ceiling.

2031

Three signals inside the same shift

STACKED WINS
30x

The multipliers compound but do not stack cleanly.

3x memory density times 3x throughput per chip times 10 to 20x generational energy gains lands between 30x and 180x cost-adjusted compute. That range is the real basis for the 100x story. It is a composite, not a single measured result.

TRAINING WALL
13B

Training is the slowest stage.

The best peer-reviewed FP4 training, Wang et al. at ICML 2025, scaled to 13B parameters on 100B tokens. NeurIPS 2025's FP4 All the Way trained a 7B model on 256 Gaudi2 chips. Neither touched frontier scale end to end.

NEW MOAT
2031

The advantage moves from chip count to chip generation.

FP4 gains are native only on Blackwell and AMD MI355, not on H100 or H200. The moat does not disappear, it shifts from who owns the most chips to who owns the newest chips and the tooling to use them right.

Pull back five years. The real shift is not the number. It is who gets to play.

When you can serve a frontier-class model on a quarter of the memory, the asymmetric advantage moves. The barrier to running large models stops being raw VRAM and starts being engineering skill. Small teams get access to capability that used to require a data center.

There is a counterposition here too. The gains are native only on Blackwell and AMD MI355, not on H100 or H200. Anyone on older hardware must migrate to get FP4 at all. So the moat does not disappear. It moves from "who owns the most chips" to "who owns the newest chips and the tooling to use them right."

History rhymes. FP4 in 2026 looks like FP8 did then: promising, brittle, and one generation from boring. The teams that learn it early will compound that head start.

But only cash is real, and the cash question is per-token cost. Spheron's data shows B200 FP4 running about 1.4x cheaper per token than H100 FP8 at current cloud prices. That is a clean win, not a revolution. The revolution lives in the forecast, and forecasts are not invoices.

What to Build This Weekend

You do not need a Blackwell cluster to learn this. You need a small model and a willingness to break things. Start tiny.

First, grab an open 7B or 8B model and run it at FP16. Note the memory it eats and the tokens per second. That is your baseline.

Then quantize it. Use a post-training quantization pipeline like TensorRT Model Optimizer if you have access to a Blackwell or RTX 50-series card. Quantization just means converting those 16-bit weights down to 4-bit. Run the same prompts again and compare.

Now test for the cracks. Throw hard, multi-step reasoning and math at both versions. This is where FP4 errors hide. If the quantized model fumbles a chain it used to nail, you just found the limit yourself. That is the lesson, not a failure.

While you wait on downloads, play with the lighter tools from today's digest to keep your reps fresh. Same.dev copies a UI from a URL with high precision, so you can clone a clean dashboard layout. Hairstyle AI previews looks from a photo, and BarBot AI suggests cocktails from what is already in your cabinet. Small builds, fast feedback.

The point is simple. Measure your baseline, compress it, then test where it breaks. Do that this weekend on one tiny model, and you will understand The Compression Principle better than any roadmap slide can teach you.

DOJO · BUILD THIS WEEKEND

Measure your baseline, compress it, then find where it breaks.

  1. Set the baseline. Grab an open 7B or 8B model and run it at FP16. Note the memory it eats and the tokens per second before you touch anything.
  2. Quantize it down. Use a post-training pipeline like TensorRT Model Optimizer on a Blackwell or RTX 50-series card to convert those 16-bit weights to 4-bit, then rerun the same prompts and compare.
  3. Hunt the cracks. Throw hard multi-step reasoning and math at both versions. If the quantized model fumbles a chain it used to nail, you just found the FP4 limit yourself.
THE BOTTOM LINE

The number is a forecast. The shift is real.

FP4 is a genuine 2 to 5x arithmetic win that compounds into a 30x-plus system gain across hardware generations, and the 100x label sits at the optimistic top of that range. Only cash is real, and Spheron's data shows B200 FP4 running about 1.4x cheaper per token than H100 FP8, a clean win rather than a revolution. The revolution lives in the forecast, and forecasts are not invoices. Build on the floor, not the ceiling, and the teams that learn FP4 early will compound that head start.

Want this every morning?

AI analysis, world news, markets, and tools. One briefing, delivered free.

One email per day. No spam. Unsubscribe anytime.