The product
Errors, latency, uptime - the classic operational signals of the thing users touch.
The model was strong on building and silent on running - so apply the backbone one more time, to operations. Build → verify → ship is only half a product's life. After that it lives: it runs, it's watched, it breaks, it accretes a maintenance tail, and eventually it's retired. Leave that half undesigned and the lean build side gets throttled by an unmanaged run side.
01 - Build & run
The builder owns operations, but an ops agent absorbs the routine. The principle is Amazon's old rule: you build it, you run it - because operational pain should feed design, and the builder knows the product best.
But single ownership can't mean the builder drowns in pages. An ops agent - encoded runbooks plus diagnostics plus auto-remediation - absorbs monitoring, known-failure fixes, dependency updates, and the long tail. The builder is paged only for a genuinely novel incident.
The person paged at 3am builds things that don't page. That's the healthy incentive of build-and-run ownership.
02 - Three-layer observability
Observability is active: it triggers the ops agent and feeds the maintenance loop, rather than waiting for someone to notice. It watches three layers.
Errors, latency, uptime - the classic operational signals of the thing users touch.
Sub-agents and proxies drifting or erroring - the machinery that builds and runs the product.
And a third layer beneath both: the outcome - is the leadership metric still moving. A product can be green on errors and latency and still be failing the only test that matters.
03 - Response
Friction and escalation scale with how far an incident can spread - the same dial as everywhere else.
On-call splits accordingly: product ops is the builder plus their ops agent; platform ops is a shared SRE function, because the substrate is a genuine shared concern.
An encoded runbook fires, diagnostics run, remediation applies. No human is paged - the routine is absorbed.
A genuinely new failure, contained to one product. The builder responds, with agents diagnosing fast alongside them.
Failure that crosses products or hits the substrate. Escalate to the technical architect or the platform team.
Routine self-heals, novel reaches the builder, systemic reaches the architect. The dial is blast radius.
04 - The learning loop
A novel incident the builder resolves gets promoted into the ops agent's runbook - rule of three - so it's never novel again.
Operational judgment compounds like every other encoding. And the ops agent is itself a named, maintained proxy, with an owner and a lifecycle.
05 - Portfolio
The maintenance tail is a cost, and someone has to own that fact. A product with a big tail and low value should be simplified or killed - the portfolio's call.
Because the builder runs what they build, they internalise that cost at design time and build operable, self-healing things in the first place.
06 - Paved road
Builders shouldn't reinvent ops. The platform and golden paths bake in observability, standard error handling, and auto-rollback, so monitoring and self-healing come for free.
DevOps and the platform architect own making the substrate operable.
Monitoring and self-healing come for free when the road is paved for them.
Regulated ops
In regulated domains, even an incident hotfix can't bypass segregation of duties. The answer is a pre-authorised, logged break-glass procedure: act fast, but it's recorded and reviewed afterward.
Speed and control are reconciled by governing the emergency path itself, rather than pretending emergencies won't happen.
07 - Capacity
Just as a product can be too big for one builder to verify, a portfolio can be too big for one builder to run. There's a carrying capacity - how many live products one builder plus their ops agent can sustainably operate.
Exceeding it forces a choice: split ownership, add people, or retire products. A product nobody can sustainably run is a liability; transfer it (with its proxies and runbooks) or retire it.
A product nobody can sustainably run is a liability. Transfer it or retire it.
08 - The tension to keep honest
That's what happens if operational debt accumulates faster than the ops agent absorbs it. Three defences hold the line.
A good ops agent - runbooks, diagnostics, and auto-remediation that genuinely absorb the routine tail.
Ruthless retirement of low-value products, so the maintenance tail doesn't grow unchecked.
Operability-by-default on the platform, so every product starts already monitored and self-healing.
Let those slip, and the model's lean build side is silently strangled by its run side.
A product lives after ship: it runs, breaks, accretes a tail, and is retired. Apply the backbone - you build it, you run it, but an ops agent runs the routine, paging the builder only for the novel. Three-layer observability triggers a tiered response by blast radius; resolved incidents teach the runbook; operational load feeds the portfolio; operability is a paved-road default; and run-capacity is a real limit. Encode the routine, reserve the human for the novel incident.