Context compression
Summarising or compacting earlier context so fewer tokens stay live. Claude does this.
The model keeps making the same architectural choices: keep context small, push work into sub-agents, resolve only the relevant slice. Those choices aren't stylistic - they fall straight out of how the technology costs money. This page is the short version of why.
01 - The cost
A long context is expensive on every step, not just once. A transformer's attention layer compares every token with every other token - a matrix multiplication that scales with the square of the context length.
Quadratic, not exponential, but the practical pain is similar: double the context and the attention work roughly quadruples. That's the root fact everything else follows from.
A long context is expensive on every single step, not just once.
02 - The KV cache
The KV cache trades compute for memory. To avoid recomputing the past on every new token, the model caches the key and value vectors for tokens it has already seen.
That saves compute, but the cache grows with context length and eats a lot of RAM/VRAM - often the real limit on how long a context you can run and how many requests you can serve at once.
A long context costs you twice: quadratic compute on every step, and linear memory you have to hold the whole time.
03 - Stay small
The whole craft is keeping the live context lean. Three methods carry most of the load.
Summarising or compacting earlier context so fewer tokens stay live. Claude does this.
Pull in only what's relevant instead of stuffing everything into the window.
Reuse the work already done on tokens the model has seen before.
This is the technical reason the federated store resolves only the relevant slice - a smaller context is cheaper, faster, and lighter on memory.
04 - Retrieval-augmented
RAG gives an agent a large knowledge base without paying the quadratic cost of holding it all in the window.
Store the corpus outside the model; at query time, retrieve only the relevant pieces - typically by embedding the query and finding the nearest chunks by vector similarity - then inject those into the context.
Access a large knowledge base without paying the quadratic cost of holding it all in the window.
05 - Four toolcases
Development is plan, then execute across four tools, each chosen to manage context, cost, and reliability.
The live context window where the agent plans and acts. Everything here is paid for repeatedly, so it should stay lean.
Packaged capability loaded into the main context for a task, then dropped. Focused expertise without a separate model.
Run in their own context window and return only a result. This isolates context, cuts cost (you don't pay for their intermediate work in the main thread), and lets each use a cheaper model or run in parallel. The computational reason specialist sub-agents save money, not just effort.
Deterministic triggers that fire on events, with no model call. Cheap, reliable automation - and a natural place to capture decisions for the feedback loop.
Deciding what work lives in the main thread versus a skill, a sub-agent, or a hook is the AI pipeline engineer's and platform architect's job - and that allocation is the paved road.
Get the allocation right and the builder runs fast and cheap; get it wrong and every feature drags a bloated context behind it.