Insights

image

12 Jun 2026

TLDR;

  • Within roughly three years, "context length" stops being a meaningful headline number. Frontier model memory becomes three-tiered, like the registers/RAM/disk hierarchy in computing.
  • The tiers: a small exact-attention window, a large learned-compressed working memory, and an unbounded indexed corpus with retrieval built into the model instead of bolted on as RAG.
  • A bigger advertised window doesn't mean the model reasons over it well. Judge models by multi-hop reasoning across large corpora, not by token count.
  • If you're building on AI: keep chunking logic thin and replaceable, invest in clean well-permissioned data, and watch multi-hop benchmarks.

The Headline Number That Won't Matter

Every few months, a lab announces a bigger context window. One million tokens. Ten million. Twelve million. The numbers make for great launch posts, but they hide a more interesting story: the flat context window, one undifferentiated sequence of tokens the model attends over, is an architectural dead end at scale.

Here's my prediction: within roughly three years, "context length" stops being a meaningful headline number, because frontier models stop having a single context at all. Instead, model memory becomes three-tiered, much like the memory hierarchy that has organized computing for fifty years (registers, RAM, disk):

  1. An inner tier: a relatively small window of exact, dense attention. Full fidelity, every token visible to every other token.
  2. A middle tier: a large, learned-compressed representation of recent context. Millions of tokens stored at reduced resolution, but still differentiable and natively understood by the model.
  3. An outer tier: an effectively unbounded indexed corpus, with retrieval fused directly into the model rather than bolted on as a separate RAG pipeline.

It's also important to realize that just because the model is advertised to have a larger context window doesn't mean the model will actually handle that large context as well as it could if the actual context was within the size of its native context window.

That's why the interesting benchmark stops being window size and becomes multi-hop reasoning quality across tiers: can the model chase a fact from the outer tier into working memory, combine it with three others, and reason precisely over the result?

The reason this is inevitable is physics and economics. Dense attention compares every token against every other token, which is quadratic compute. And the KV cache (the model's working memory of what it has read) grows linearly until memory bandwidth, not compute, becomes the binding constraint. Both walls hit hard past a million tokens. You don't beat that with bigger GPUs; you beat it with hierarchy.

Let's look at the candidate technologies for each tier.

Tier 1: The Exact Window. Making Dense Attention Cheap Enough to Keep

The inner tier is the easiest to predict, because the work is already well underway. The goal here isn't to make the exact window enormous. It's to make a few hundred thousand tokens of genuinely precise attention cheap enough to serve at scale.

Trained sparse attention becomes the default, not a retrofit. Most token-to-token comparisons in dense attention are wasted compute. Content-dependent sparse attention, where the model learns which positions matter rather than using fixed patterns, has moved from research curiosity (DeepSeek's NSA, Moonshot's MoBA) to commercial claims of 50×+ speedups at a million tokens. The key shift coming: sparsity learned during pretraining rather than bolted on afterward. A model that grows up sparse routes information fundamentally differently than one that had sparsity grafted on, and it pays far less of a quality tax for it.

KV-cache compression compounds the win. Because the binding constraint at long context is memory bandwidth, expect aggressive stacking of: latent-space caches that store a compressed projection instead of full keys and values; learned eviction policies that drop tokens the model decides it won't need again; and 2-4 bit cache quantization. Individually these are incremental; compounded, 10-50× effective cache reduction is a reasonable expectation, which means today's "long context" becomes tomorrow's cheap inner tier.

What won't work: hoping hardware alone rescues dense quadratic attention. The quadratic term always wins eventually. The inner tier stays bounded by design. That's what makes it a tier.

Tier 2: Compressed Working Memory. The Genuinely New Layer

The middle tier is where I expect the real architectural novelty, because it barely exists today. Right now the gap between "in the window" and "outside the window" is a cliff. The middle tier turns that cliff into a slope: millions of tokens held at reduced resolution, still inside the model's native representation space.

Several pathways are competing to build it:

Hybrid architectures: linear layers carry the gist, sparse layers do exact recall. State-space models (the Mamba family) and linear attention process sequences in constant memory by compressing everything they read into a fixed-size running state. Pure versions of these architectures consistently fail at exact recall: ask for a verbatim detail from 2M tokens ago and the compressed state has smeared it away. But the hybrid recipe, mostly-linear layers with a minority of full-attention layers interleaved, keeps showing up in the research literature because it matches the actual structure of the problem: most tokens need compression, a few need exact lookup. I expect this mix to become the standard backbone.

Learned hierarchical memory: summaries as first-class latent objects. Today, when an agent's context fills up, it writes a text summary of itself. That's a crude, lossy, manual hack. The natural evolution is to make this native: the model builds multi-resolution representations of what it has read (sentence-level, section-level, document-level latents) and attends over the hierarchy, descending to raw tokens only when a question demands precision. This is the difference between a 12M-token window and actually reasoning over 12M tokens. Flat attention doesn't give you the latter even when the sequence fits.

Test-time memory consolidation: the model rewrites its own memory mid-task. Agentic workloads make this nearly inevitable. An agent running for days cannot keep raw history; it must consolidate, the way biological memory consolidates during sleep. Expect models that periodically compress their own context into denser learned representations as a trained, differentiable operation rather than a prompt-engineering trick. The agent harnesses of today are manually prototyping what the architectures of tomorrow will internalize.

The wildcard: weight-space memory. The most speculative pathway dissolves the context/finetuning distinction entirely. Fast-weight and test-time-training approaches distill what the model reads into temporary low-rank weight updates, so "reading your codebase" becomes a brief, cheap finetune rather than a prompt. This offers effectively unbounded capacity, but imports hard new problems: catastrophic forgetting, isolation between users, and auditability of what the model has absorbed. I'd bet on it appearing first in single-tenant enterprise deployments, where the isolation problem is already solved by the deployment model.

Tier 3: The Unbounded Outer Tier. Retrieval Stops Being a Bolt-On

Today's answer to "more knowledge than fits in context" is RAG: an external embedding model, a vector database, and a pipeline that pastes retrieved chunks into the prompt. It works, but it's a seam. The retriever and the reasoner are separate systems with separate failure modes, and the model can't learn to retrieve better because retrieval happens outside it.

The outer tier prediction: retrieval and attention merge into one mechanism. Squint at sparse attention and at vector search and you see the same operation: find the most relevant items, weight them, combine them. A learned, end-to-end retrieval layer is just extremely sparse attention over an indexed corpus. When that lands, the long-running "RAG vs. long context" debate dissolves: they were always the same thing at different sparsity levels.

Pathways here:

End-to-end trained retrieval layers. The model's own representations index the corpus, and retrieval quality improves with model training rather than depending on a frozen third-party embedding model. Multi-hop retrieval, where finding the second document depends on understanding the first, becomes a learned model behavior instead of orchestration code.

Context as a managed runtime. On the systems side, treat context the way an operating system treats memory: page cold KV cache to CPU and disk with learned prefetching; share immutable prefix caches across users and sessions; "compile" large corpora offline into model-native compressed representations that load like files. Your company's entire document history becomes a preprocessed artifact the model mounts, not a string it re-reads. The economics of long context will be solved as much by this infrastructure layer as by any architecture change.

Persistent, structured memory for agents. The outer tier is also where agent memory lives. Not transcripts of past sessions, but consolidated knowledge distilled from them, indexed and retrievable. The systems that get tier-3-to-tier-1 promotion right (surfacing exactly the right prior knowledge into working memory at exactly the right moment) will define what "an AI that knows your business" actually means.

What This Means If You're Building on AI Today

A forecast like this earns its keep only if it changes decisions. Three practical implications:

Stop architecting around the context limit as a permanent constraint. Systems built today with elaborate chunking and stuffing strategies tuned to current window sizes are building against a wall that's moving. Keep that logic thin and replaceable.

Don't bet against retrieval. Bet on its interface changing. The investment that holds value isn't your vector database configuration; it's clean, well-structured, well-permissioned data. Whatever form tier 3 takes, it will reward organizations whose corpus is an asset rather than a swamp.

Watch multi-hop benchmarks, not window sizes. When evaluating models and vendors, ignore the headline token count. Ask how well the system chains multiple facts spread across a large corpus. That's the capability the three-tier architecture exists to deliver, and it's where today's flat-window systems quietly fail.

The labs shipping million-token windows are solving the right problem with what will look, in hindsight, like a transitional design. The context window isn't getting bigger forever. It's getting layered. And the moment that happens, the question changes from "how much can it read?" to "how well does it remember?"

Thinking through what an AI memory strategy means for your organization's data and systems? Schedule a free consultation with LOJI.

Related Insights

Can a Lovable App Work in Production? cover image
12 May 2026

Can a Lovable App Work in Production?

A Lovable app can be a strong prototype, but production depends on auth, data ownership, deployment, security, monitoring, and maintainable code.