Same model. In some cases, similar quality, give or take 1%. The trick is squeezing each weight into 4 bits instead of 16.
Here is the honest version. The 100x figure is not one benchmark. It is a stack of smaller wins multiplied together. Some of those wins are real and measured. Some are forecasts dressed up as facts. I think the truth sits in the middle, and the middle is still a big deal.
The Compression Principle
Here is the framework: cheaper math compounds. When you cut precision, you do not save in one place. You save in three at once. Memory, throughput, and energy all drop together.
The honest multipliers behind the 100x story.
Call it The Compression Principle. Lower the bits per number, and the whole system gets lighter at the same time.
FP4 means each weight gets stored in 4-bit floating point instead of 16-bit or 8-bit. A 70B model drops from about 140 GB in BF16 to roughly 35 to 40 GB in INT4/AWQ, per Spheron's published table. Same parameter count, a quarter of the storage.
The "microscaling" part is what makes this work. Instead of one scale factor for a whole tensor, you assign a shared scale to each small block of values, 16 per block in NVFP4 and 32 in the MXFP4 standard. Each block zooms in on its own range. That is how you keep accuracy while crushing the bits.
Why the Multipliers Stack and Where They Break
Let me think about this like a system, not a slogan. A system has inputs, bottlenecks, and a real output. The 100x claim treats every input as if it stacks cleanly. It does not. Some stages bottleneck the others.
Start with the inputs that are actually measured. NVFP4 cuts memory about 3.5x versus FP16, per NVIDIA's June 2025 blog. End-to-end inference throughput reportedly runs up to 2.2x faster than BF16 on a B200 for large LLMs, per the MR-GPTQ work at ICLR 2026. On an RTX 5090, FP4 hits up to 4x end-to-end speedups (6x layer-wise) versus FP16 on the same card.
Then there is the energy number, which is the biggest lever. NVIDIA reports up to 50x better throughput per megawatt on Blackwell Ultra versus Hopper for reasoning inference. That gain is not pure FP4. It bundles denser tensor cores, a better memory subsystem, and liquid cooling.
So multiply the honest pieces: 3x memory density, times 3x throughput per chip, times 10 to 20x generational energy gains. You land somewhere between 30x and 180x cost-adjusted compute. That range is the real basis for the 100x story. It is a composite, not a single result.
Now the bottleneck. A system is only as fast as its slowest stage. Training is that stage. The best peer-reviewed FP4 training work, Wang et al. at ICML 2025, scaled to 13B parameters on 100B tokens. "FP4 All the Way" at NeurIPS 2025 trained a 7B model on 256 Gaudi2 chips up to 200B tokens. Neither touched frontier scale end to end.
There is also a quality crack. We do not yet know whether those errors stay quiet at trillion-parameter scale or compound badly.
My read on this: FP4 is a genuine 2 to 5x arithmetic win that becomes a 30x-plus system win across hardware generations. The 100x label is the optimistic top of a wide range. Build your plans on the floor, not the ceiling.
2031
Three signals inside the same shift
The multipliers compound but do not stack cleanly.
3x memory density times 3x throughput per chip times 10 to 20x generational energy gains lands between 30x and 180x cost-adjusted compute. That range is the real basis for the 100x story. It is a composite, not a single measured result.
Training is the slowest stage.
The best peer-reviewed FP4 training, Wang et al. at ICML 2025, scaled to 13B parameters on 100B tokens. NeurIPS 2025's FP4 All the Way trained a 7B model on 256 Gaudi2 chips. Neither touched frontier scale end to end.
The advantage moves from chip count to chip generation.
FP4 gains are native only on Blackwell and AMD MI355, not on H100 or H200. The moat does not disappear, it shifts from who owns the most chips to who owns the newest chips and the tooling to use them right.
Pull back five years. The real shift is not the number. It is who gets to play.
When you can serve a frontier-class model on a quarter of the memory, the asymmetric advantage moves. The barrier to running large models stops being raw VRAM and starts being engineering skill. Small teams get access to capability that used to require a data center.
There is a counterposition here too. The gains are native only on Blackwell and AMD MI355, not on H100 or H200. Anyone on older hardware must migrate to get FP4 at all. So the moat does not disappear. It moves from "who owns the most chips" to "who owns the newest chips and the tooling to use them right."
History rhymes. FP4 in 2026 looks like FP8 did then: promising, brittle, and one generation from boring. The teams that learn it early will compound that head start.
But only cash is real, and the cash question is per-token cost. Spheron's data shows B200 FP4 running about 1.4x cheaper per token than H100 FP8 at current cloud prices. That is a clean win, not a revolution. The revolution lives in the forecast, and forecasts are not invoices.
What to Build This Weekend
You do not need a Blackwell cluster to learn this. You need a small model and a willingness to break things. Start tiny.
First, grab an open 7B or 8B model and run it at FP16. Note the memory it eats and the tokens per second. That is your baseline.
Then quantize it. Use a post-training quantization pipeline like TensorRT Model Optimizer if you have access to a Blackwell or RTX 50-series card. Quantization just means converting those 16-bit weights down to 4-bit. Run the same prompts again and compare.
Now test for the cracks. Throw hard, multi-step reasoning and math at both versions. This is where FP4 errors hide. If the quantized model fumbles a chain it used to nail, you just found the limit yourself. That is the lesson, not a failure.
While you wait on downloads, play with the lighter tools from today's digest to keep your reps fresh. Same.dev copies a UI from a URL with high precision, so you can clone a clean dashboard layout. Hairstyle AI previews looks from a photo, and BarBot AI suggests cocktails from what is already in your cabinet. Small builds, fast feedback.
The point is simple. Measure your baseline, compress it, then test where it breaks. Do that this weekend on one tiny model, and you will understand The Compression Principle better than any roadmap slide can teach you.
Measure your baseline, compress it, then find where it breaks.
- Set the baseline. Grab an open 7B or 8B model and run it at FP16. Note the memory it eats and the tokens per second before you touch anything.
- Quantize it down. Use a post-training pipeline like TensorRT Model Optimizer on a Blackwell or RTX 50-series card to convert those 16-bit weights to 4-bit, then rerun the same prompts and compare.
- Hunt the cracks. Throw hard multi-step reasoning and math at both versions. If the quantized model fumbles a chain it used to nail, you just found the FP4 limit yourself.
The number is a forecast. The shift is real.
FP4 is a genuine 2 to 5x arithmetic win that compounds into a 30x-plus system gain across hardware generations, and the 100x label sits at the optimistic top of that range. Only cash is real, and Spheron's data shows B200 FP4 running about 1.4x cheaper per token than H100 FP8, a clean win rather than a revolution. The revolution lives in the forecast, and forecasts are not invoices. Build on the floor, not the ceiling, and the teams that learn FP4 early will compound that head start.