The 1000x Attention Gap

A 4-person startup in Miami just shipped a model with a 12 million token context window. For context, that is roughly 12 times what Claude Opus or GPT-5.5 can handle today. SubQ 1M-Preview, released May 5, 2026, claims to cut attention compute by up to 1,000 times compared to frontier models at long context lengths. The inference cost? Roughly one-twentieth of Claude Opus for similar coding performance, according to Towards AI.

Those numbers sound impossible. And honestly, they might be. But the architecture underneath them is real, the $29 million in seed funding is real, and the implications for anyone building long-context applications deserve a serious look. Here is what matters, what is hype, and what you should do about it right now.

The Compression Principle

The core insight behind SubQ is simple enough to fit on a napkin. I call it the Compression Principle: the model that processes the most context per dollar wins the long-context market.

CONTEXT ECONOMICS · MAY 2026TOWARDS AI · ARXIV · LESSERWRONG · GRAND VIEW RESEARCH

The numbers behind the subquadratic attention bet.

Frontier Index Ceiling GPT-5.5 · April 23 release

60.24

MRCR v2 Lab Score SubQ Research · 8-needle retrieval

83%

MRCR v2 Production Score SubQ 1M-Preview · deployed

65.9%

Opus Pricing Tier Anthropic · monthly subscription

$20/mo

Standard transformer attention works like a room where every person must shake hands with every other person. Ten people means 45 handshakes. A hundred people means 4,950. A million people means roughly 500 billion. That is quadratic scaling, and it is why sending a million tokens through GPT-5.5 costs a fortune and takes forever.

Subquadratic Sparse Attention, or SSA, changes the handshake rule. Instead of everyone greeting everyone, each person only talks to the people who matter most to them. The compute drops from O(n²) to something closer to O(n). Linear. The cost of doubling your context roughly doubles your bill instead of quadrupling it.

This is the Compression Principle in one sentence: architectures that compress attention cost per token will unlock applications that quadratic models literally cannot afford to run. Full-repo code analysis. Multi-document legal review. Chat histories that stretch back months instead of hours. The constraint was never intelligence. It was math.

How SSA Actually Works (and Where It Might Break)

Let me walk you through the 20% of this architecture that explains 80% of the results.

Architectures that compress attention cost per token will unlock applications that quadratic models literally cannot afford to run. Full-repo code analysis. Multi-document legal review. Chat histories that stretch back months instead of hours. The constraint was never intelligence. It was math.· KODA EDITORIAL ANALYSIS · MAY 2026

SSA does three things differently from standard attention. First, selective attention. For each token, the model scores every other token on relevance and then only computes full attention over a small, high-scoring subset. Think of it like a 500 IQ intern who reads the table of contents before reading the book. It skips the noise and focuses on keywords, entities, and contextual signals.

Second, local plus global patterns. SSA always attends to nearby tokens (your neighbors in the sequence) and a handful of globally important tokens across the entire input. This is the "don't lose the forest for the trees" mechanism. It keeps both micro-context and macro-context alive.

Third, hierarchical clustering. Similar tokens get grouped into clusters. Attention is computed at the cluster level first, then zoomed into individual tokens only where needed. This is where the real compute savings come from. Instead of evaluating every pair, you evaluate every cluster pair, then drill down. The math drops dramatically.

The benchmarks look strong on paper.

But here is where I put on the skeptic hat. The MRCR v2 benchmark tells a more complicated story. SubQ's research model scored 83% on 8-needle retrieval at 1 million tokens. The production model, SubQ 1M-Preview, scored 65.9%. That is a 17-point drop between lab and deployment. Nobody has explained why.

It is unclear whether this gap reflects overfitting to synthetic benchmarks, deployment instability, or something more fundamental. A paper from Alman and Yu at arXiv (2410.04271, revised May 2025) proves, under the 3SUM conjecture from complexity theory, that truly subquadratic algorithms cannot solve document similarity tasks at the same capability level as quadratic transformers. If that conjecture holds, SSA either secretly uses quadratic components in critical layers or underperforms on similarity-heavy workloads like RAG and multi-document QA.

History backs up the skepticism. Longformer, BigBird, Reformer, Mamba, RWKV. Every one of these promised subquadratic attention. Every one either underperformed pure transformers on quality benchmarks or reverted to quadratic attention in hybrid layers to stay competitive. A detailed LessWrong analysis from 2025 found that Kimi Linear, which claimed linear scaling, still used quadratic Multi-Latent Attention in 25% of its layers because the full-linear version suffered unacceptable performance hits.

My read on this: SubQ's architecture is genuinely novel. The SSA approach with hierarchical clustering is more sophisticated than previous attempts. But the 1,000x compute reduction claim is a research number, not a production number. The production reality is probably closer to 5 to 20 times cheaper, which is still significant. Treat the 52x speed claim and the 1,000x efficiency claim as upper bounds, not guarantees. The real test comes when independent researchers get API access and run their own evaluations.

One more thing worth noting. SubQ 1M-Preview uses only 760 million active parameters per token. That is tiny compared to frontier models. The efficiency comes from the architecture, not from brute-force scale. This is a fundamentally different bet on how intelligence should be structured. Simple architecture, massive context. Not massive parameters, limited context.

2031

Three signals inside the same shift

ARCHITECTURE SHIFT

1,000×

SubQ claims a 1,000x compute reduction at long context lengths.

Subquadratic Sparse Attention replaces O(n²) with near-O(n) scaling through selective attention, local-global patterns, and hierarchical clustering. The production reality is likely 5 to 20x cheaper, but even the conservative end reshapes inference economics for million-token workloads.

LAB-TO-PROD GAP

17pt

A 17-point MRCR v2 drop between research and production remains unexplained.

SubQ's research model scored 83% on 8-needle retrieval at 1M tokens. The deployed 1M-Preview scored 65.9%. History from Longformer, BigBird, and Kimi Linear shows subquadratic architectures consistently revert to quadratic components to maintain quality. The gap demands independent verification.

MARKET UNLOCK

2031

Hybrid architectures will likely dominate frontier models by 2031.

The AI inference market is projected past $50B by 2028. Long-context applications represent under 10% of current workloads not from low demand but from prohibitive cost. Convergence toward quadratic attention for dense tasks and subquadratic layers for everything else is the most probable outcome.

Zoom out five years. The AI inference market was roughly $5 billion in 2025. Grand View Research projects it past $50 billion by 2028. Long-context applications, agents, enterprise search, repository-scale coding, represent less than 10% of current workloads. Not because demand is low. Because the cost is too high.

This is the asymmetric opportunity. If subquadratic architectures deliver even half of what SubQ claims, the cost barrier drops enough to unlock an entirely new category of applications. Full codebase reasoning. Legal discovery across thousands of documents in a single pass. Medical records spanning a patient's entire history. These are not incremental improvements. They are new products that cannot exist under quadratic economics.

The compounding effect matters here. Every reduction in cost-per-token at long context lengths makes new use cases viable. Those use cases generate data. That data trains better models. Better models attract more users. This is the flywheel that turned search from a research project into a trillion-dollar industry.

But the contrarian case deserves equal weight. Big Tech is not standing still. xAI shipped Grok-3 with a 1 million token context in February 2026. Mistral Large 3 launched in April 2026 with competitive long-context performance. FlashAttention-3, RoPE embeddings, and grouped-query attention keep pushing transformer efficiency forward without abandoning the quadratic architecture. The optimization path for transformers is well-funded, well-understood, and backed by mature tooling.

I think the most likely outcome is convergence. By 2031, frontier models will be hybrids. Quadratic attention for tasks that need it (dense similarity, precise retrieval) and subquadratic layers for everything else. SubQ's contribution will not be replacing transformers. It will be proving that linear-scaling attention works well enough in production to force every major lab to adopt it selectively.

The salary-versus-equity framing applies here. Sticking with pure transformer APIs buys you reliability today. Investing time in subquadratic workflows buys you a cost advantage that compounds over the next five years. The developers who prototype million-token pipelines now, even imperfect ones, will have a structural edge when the architecture matures.

What to Build This Weekend

You do not need SubQ API access to start preparing for the long-context future. Here are three concrete things you can build this week.

First, stress-test your current context assumptions. Take your longest workflow, whether that is a RAG pipeline, a code analysis chain, or a document summarizer, and measure exactly where it breaks. How many tokens before quality degrades? How much does cost increase per doubling of input? Write these numbers down. You need a baseline before you can evaluate any new architecture.

Second, set up a model comparison pipeline using Serno. Serno lets you send the same prompt to Claude, GPT, and Gemini simultaneously and compare outputs side by side. Build a test suite of 5 long-context prompts (100K plus tokens each) and run them through all three models. Document where each model starts hallucinating, losing track of instructions, or returning incomplete answers. This gives you a decision framework for when SubQ opens its API waitlist.

Third, wire up an n8n workflow that monitors SubQ's waitlist status and benchmarks. The April 2026 AI nodes release for n8n lets you embed LLM calls directly into automation pipelines. Build a simple flow: check SubQ's API page daily, parse any new benchmark results, and send yourself a summary via Spokenly's voice transcription so you can review updates hands-free during your commute.

The SubQ waitlist is filling fast as of May 14, 2026. Sign up now even if you are skeptical. Access costs nothing. The worst case is you get early data on whether the Compression Principle holds in production. The best case is you are building on a 12 million token context window while your competitors are still chunking documents into 128K pieces.

Things will break. Benchmarks will not match production. Your first long-context pipeline will probably return garbage. That is normal. The goal this weekend is not perfection. It is getting your reps in so you are ready when the architecture catches up to the ambition.

DOJO · BUILD THIS WEEKEND

Prototype your long-context baseline before the architecture matures.

Stress-test your current context ceiling. Take your longest RAG pipeline or code analysis chain and measure exactly where quality degrades. Record token count, cost per doubling, and hallucination onset. You need this baseline before evaluating any subquadratic alternative.
Build a multi-model comparison suite in Serno. Send 5 identical long-context prompts (100K+ tokens each) to Claude, GPT, and Gemini simultaneously. Document where each model loses track of instructions or returns incomplete answers. This becomes your decision framework when SubQ opens API access.
Wire an n8n monitoring workflow for SubQ benchmarks. The April 2026 AI nodes release lets you embed LLM calls directly into automation pipelines. Build a flow that checks SubQ's API waitlist status and logs independent benchmark results as they appear, so you are first in line when production access ships.

THE BOTTOM LINE

The constraint was never intelligence. It was math and money.

SubQ 1M-Preview is not a finished product. The 17-point lab-to-production gap, the unresolved theoretical limits from complexity theory, and the graveyard of prior subquadratic attempts all warrant real skepticism. But the architecture is genuinely novel, the cost advantage is directionally massive, and the developers who build long-context baselines now will have a structural edge when hybrid models become the default. Bet on convergence. Prototype on the frontier. Measure everything.

A 4-Person Startup Just Shipped a 12 Million Token
Context Window. Here Is What That Actually Means.

The Compression Principle

The numbers behind the subquadratic attention bet.

How SSA Actually Works (and Where It Might Break)

2031

Three signals inside the same shift

SubQ claims a 1,000x compute reduction at long context lengths.

A 17-point MRCR v2 drop between research and production remains unexplained.

Hybrid architectures will likely dominate frontier models by 2031.

What to Build This Weekend

Prototype your long-context baseline before the architecture matures.

The constraint was never intelligence. It was math and money.

Want this every morning?