The AI-Native Product Org · A1

How AI development works.

Background for why the model is built around context and sub-agents. Skip it if you already know this.

The model keeps making the same architectural choices: keep context small, push work into sub-agents, resolve only the relevant slice. Those choices aren't stylistic - they fall straight out of how the technology costs money. This page is the short version of why.

Read

01 - The cost

Context is the cost.

A long context is expensive on every step, not just once. A transformer's attention layer compares every token with every other token - a matrix multiplication that scales with the square of the context length.

Quadratic, not exponential, but the practical pain is similar: double the context and the attention work roughly quadruples. That's the root fact everything else follows from.

A long context is expensive on every single step, not just once.

Cost grows with the square of context length. The pain isn't paid once - it's paid on every step, which is why keeping the live context lean matters so much.

02 - The KV cache

The KV cache moves the cost to memory.

The KV cache trades compute for memory. To avoid recomputing the past on every new token, the model caches the key and value vectors for tokens it has already seen.

That saves compute, but the cache grows with context length and eats a lot of RAM/VRAM - often the real limit on how long a context you can run and how many requests you can serve at once.

A long context costs you twice: quadratic compute on every step, and linear memory you have to hold the whole time.

03 - Stay small

Fighting back: keep the context small.

The whole craft is keeping the live context lean. Three methods carry most of the load.

Compress

Context compression

Summarising or compacting earlier context so fewer tokens stay live. Claude does this.

Retrieve

Retrieval, not stuffing

Pull in only what's relevant instead of stuffing everything into the window.

Cache

Prompt / KV caching

Reuse the work already done on tokens the model has seen before.

This is the technical reason the federated store resolves only the relevant slice - a smaller context is cheaper, faster, and lighter on memory.

04 - Retrieval-augmented

RAG deserves its own name.

RAG gives an agent a large knowledge base without paying the quadratic cost of holding it all in the window.

Store the corpus outside the model; at query time, retrieve only the relevant pieces - typically by embedding the query and finding the nearest chunks by vector similarity - then inject those into the context.

Access a large knowledge base without paying the quadratic cost of holding it all in the window.

Retrieve only the relevant chunks, inject those, leave the rest in store. The federated store's scope resolution is essentially RAG over the guideline corpus - retrieve the platform, product, feature, and touched-domain slices, and nothing else.

05 - Four toolcases

The four toolcases.

Development is plan, then execute across four tools, each chosen to manage context, cost, and reliability.

main thread (paid repeatedly) skill (loaded, then dropped) sub-agent (own context) hook (no model call)

Four places work can live - and deciding which is the AI pipeline engineer's and platform architect's job.

Main thread

Paid repeatedly, keep lean

The live context window where the agent plans and acts. Everything here is paid for repeatedly, so it should stay lean.

Skills

Loaded in, then dropped

Packaged capability loaded into the main context for a task, then dropped. Focused expertise without a separate model.

Sub-agents

Own context, return a result

Run in their own context window and return only a result. This isolates context, cuts cost (you don't pay for their intermediate work in the main thread), and lets each use a cheaper model or run in parallel. The computational reason specialist sub-agents save money, not just effort.

Hooks

Deterministic, no model call

Deterministic triggers that fire on events, with no model call. Cheap, reliable automation - and a natural place to capture decisions for the feedback loop.

Deciding what work lives in the main thread versus a skill, a sub-agent, or a hook is the AI pipeline engineer's and platform architect's job - and that allocation is the paved road.

Get the allocation right and the builder runs fast and cheap; get it wrong and every feature drags a bloated context behind it.