The AI-Native Product Org · 17

Operations & the living product.

A product isn't done at ship - it lives.

The model was strong on building and silent on running - so apply the backbone one more time, to operations. Build → verify → ship is only half a product's life. After that it lives: it runs, it's watched, it breaks, it accretes a maintenance tail, and eventually it's retired. Leave that half undesigned and the lean build side gets throttled by an unmanaged run side.

Read

01 - Build & run

You build it, you run it - and the agent runs the routine.

The builder owns operations, but an ops agent absorbs the routine. The principle is Amazon's old rule: you build it, you run it - because operational pain should feed design, and the builder knows the product best.

But single ownership can't mean the builder drowns in pages. An ops agent - encoded runbooks plus diagnostics plus auto-remediation - absorbs monitoring, known-failure fixes, dependency updates, and the long tail. The builder is paged only for a genuinely novel incident.

The person paged at 3am builds things that don't page. That's the healthy incentive of build-and-run ownership.

02 - Three-layer observability

Observability is a sensor, not a dashboard.

Observability is active: it triggers the ops agent and feeds the maintenance loop, rather than waiting for someone to notice. It watches three layers.

Three layers feed one active sensor. It doesn't wait to be glanced at - it triggers the ops agent and feeds the maintenance loop the moment a layer drifts.

Layer 1

The product

Errors, latency, uptime - the classic operational signals of the thing users touch.

Layer 2

The AI system

Sub-agents and proxies drifting or erroring - the machinery that builds and runs the product.

And a third layer beneath both: the outcome - is the leadership metric still moving. A product can be green on errors and latency and still be failing the only test that matters.

03 - Response

Incident response by blast radius.

Friction and escalation scale with how far an incident can spread - the same dial as everywhere else.

On-call splits accordingly: product ops is the builder plus their ops agent; platform ops is a shared SRE function, because the substrate is a genuine shared concern.

Routine / known

The ops agent self-heals

An encoded runbook fires, diagnostics run, remediation applies. No human is paged - the routine is absorbed.

Novel but bounded

The builder

A genuinely new failure, contained to one product. The builder responds, with agents diagnosing fast alongside them.

Systemic / shared

The architect / platform

Failure that crosses products or hits the substrate. Escalate to the technical architect or the platform team.

Routine self-heals, novel reaches the builder, systemic reaches the architect. The dial is blast radius.

04 - The learning loop

Incidents are a learning loop.

A novel incident the builder resolves gets promoted into the ops agent's runbook - rule of three - so it's never novel again.

Operational judgment compounds like every other encoding. And the ops agent is itself a named, maintained proxy, with an owner and a lifecycle.

The resolution doesn't stay in one person's head. It's promoted into the runbook, so the next occurrence is routine - the ops agent self-heals it.

05 - Portfolio

Operational load feeds the portfolio.

The maintenance tail is a cost, and someone has to own that fact. A product with a big tail and low value should be simplified or killed - the portfolio's call.

Because the builder runs what they build, they internalise that cost at design time and build operable, self-healing things in the first place.

06 - Paved road

Operability is a paved-road default.

Builders shouldn't reinvent ops. The platform and golden paths bake in observability, standard error handling, and auto-rollback, so monitoring and self-healing come for free.

DevOps and the platform architect own making the substrate operable.

Monitoring and self-healing come for free when the road is paved for them.

Regulated ops

Governed break-glass.

In regulated domains, even an incident hotfix can't bypass segregation of duties. The answer is a pre-authorised, logged break-glass procedure: act fast, but it's recorded and reviewed afterward.

Speed and control are reconciled by governing the emergency path itself, rather than pretending emergencies won't happen.

07 - Capacity

Run-capacity is a real limit.

Just as a product can be too big for one builder to verify, a portfolio can be too big for one builder to run. There's a carrying capacity - how many live products one builder plus their ops agent can sustainably operate.

Exceeding it forces a choice: split ownership, add people, or retire products. A product nobody can sustainably run is a liability; transfer it (with its proxies and runbooks) or retire it.

A product nobody can sustainably run is a liability. Transfer it or retire it.

08 - The tension to keep honest

"You build it, you run it" can become "you build it, then you drown."

That's what happens if operational debt accumulates faster than the ops agent absorbs it. Three defences hold the line.

1
A good ops agent - runbooks, diagnostics, and auto-remediation that genuinely absorb the routine tail.
2
Ruthless retirement of low-value products, so the maintenance tail doesn't grow unchecked.
3
Operability-by-default on the platform, so every product starts already monitored and self-healing.

Let those slip, and the model's lean build side is silently strangled by its run side.

The one-line version

A product lives after ship: it runs, breaks, accretes a tail, and is retired. Apply the backbone - you build it, you run it, but an ops agent runs the routine, paging the builder only for the novel. Three-layer observability triggers a tiered response by blast radius; resolved incidents teach the runbook; operational load feeds the portfolio; operability is a paved-road default; and run-capacity is a real limit. Encode the routine, reserve the human for the novel incident.