meta
A report arriving changes nothing — the broken link was 'route', not 'measure'
· Ascendy Engineering
TL;DR
- A report or alert that just arrives on a schedule is worth zero by itself. Value appears only when it drives a closed loop (measure → analyze → improve → re-measure).
- To improve monitoring people usually reach for “collect more metrics,” but splitting our loop into stages showed the broken link wasn’t ‘measure’ — it was ‘capture→route’: a surfaced improvement only became action once a human remembered it and re-entered it somewhere. The human was the manual relay bottleneck.
- The fix: ① a durable backlog (a file holds it instead of memory), ② an agent drafts the handoff/PR directly (the human only approves). Take the human out of transport, keep them as approver.
- Two missteps: recommending “run it on a scheduler, permanently,” then finding that scheduler is session-scoped and expires in ~7 days — an automation that expires isn’t automation. And making sure the backlog doesn’t become the next dead queue via an anti-rot lifecycle.
Source note. This post distills a top-level infra-team intake (
docs/intake/from-infra/2026-06-04-monitoring-closed-loop.md). The currently-open operational items the first cycle captured are excluded from the body (to avoid signaling an unresolved gap), and backlog paths / internal identifiers are generalized. It’s in the same “process / AI collaboration” vein as pairing two AIs made things slower and how you call the second AI, and when to stop it.
A report arriving, by itself
Working on a report that arrives on a schedule, I made one thing explicit. That report is worth zero by itself. Value appears only when it drives a closed loop.
measure → analyze → improve → re-measure
The report can land every week, but unless it carries through to fix-it → re-measure, it’s just numbers arriving on a schedule. Holding this principle against our operational monitoring made it sharp where the loop was broken.
The broken link wasn’t ‘measure’ — it was ‘route’
To improve monitoring people usually go “collect more metrics / build more dashboards.” But splitting our loop into stages, it wasn’t there.
[1 surface] → [2 capture] → [3 route to actor] → [4 act] → [5 re-measure]
alerts·reports·trends (depends on human memory + re-relay) PR / handoff
^^^^^^^ this is where it broke ^^^^^^^
- Stage 1 (surface) — alerts, reports, trends were already emitting improvements.
- Stages 4–5 (act, re-measure) — PR / handoff / re-measure mechanisms already existed.
- The break was at stages 2–3 — a surfaced improvement only became action once a human remembered it and re-entered it in the right place.
So the human was the manual relay bottleneck. Monitoring emits improvements; if the human doesn’t re-relay them, the signal dies right there. Human memory is volatile, and signals outnumber the human. The real gap wasn’t “not enough measuring” — it was “what was measured didn’t get routed to where action happens; it evaporated.” Collecting more metrics would have left this gap untouched.
Take the human out of transport, keep them as approver
The fix took two things out of the human’s head.
- Durable capture — write each surfaced improvement as a row in a tracked backlog. A file holds it instead of memory.
- Agent-driven routing — in the weekly review, an agent reads the backlog and the trends, and drafts the PR or handoff directly for each item.
There’s a decisive separation here.
Killing the bottleneck isn’t about “what triggers the review” — it’s about “what the review does.”
The trigger can still be a human pulling it once a week. But if, behind that one trigger, an agent has already drafted the handoffs, then “the human remembers and re-relays each signal, or it dies” becomes “the human triggers once a week and approves what the agent drafted.” The per-signal relay disappears. The human stays an approver, not transport.
Misstep ① — “run it on a scheduler” was wrong
My first answer was “run the weekly review on a scheduler job, permanently.” But checking the actual scheduler feature — that job is session-scoped and auto-expires in about 7 days. It wasn’t a “permanent automation engine.”
Left as-is, that recommendation would have written down “it runs automatically every week” and then quietly stopped a week later — reproducing the exact “silent evaporation” we were trying to fix. So I corrected it: the trigger is a human starting it once a week (with at most a light reminder nudging), and the load-bearing part of killing the bottleneck is the agent’s drafting above.
Lesson: before you write down “solved by automation,” check that automation’s lifetime and scope. An automation that expires isn’t automation.
Misstep ② — keeping the backlog from becoming the next ‘dead queue’
A durable list has its own trap. A list with no lifecycle just moves the bottleneck from “human memory” to “a pile of stale TODOs.” Write it down, nobody looks, it evaporates all the same.
So we made the backlog a state machine, not a free-form note.
- Each item has fields: when it surfaced / what surfaced it / owner / next action / state (
open→routed→acting→done/dropped) / priority / last-reviewed date. - The weekly hard rule: every open item must be touched each cycle — advanced, re-prioritized, or closed with a reason. No untouched items left behind.
- An item untouched for 2 cycles auto-escalates to STALE → it can’t rot quietly.
- A WIP cap (route at most N per cycle) → triage by priority instead of dumping everything.
Without this hygiene, the backlog is just a bigger grave.
Takeaways
- A scheduled report or alert is worth zero by itself. It only matters inside a closed loop (measure → analyze → improve → re-measure).
- Split the loop into stages to find the broken link. Often it’s not “not enough measuring” — it’s “what was measured doesn’t get routed, and evaporates.”
- If a human is the relay bottleneck, take them out of transport but keep them as approver — a durable backlog replaces memory, an agent drafts, the human approves.
- Trigger automation ≠ killing the bottleneck. And don’t mistake an expiring automation for a permanent engine.
- Give a durable list an anti-rot lifecycle — touch every item each cycle / escalate after N untouched cycles / WIP cap. Otherwise the list becomes the next grave.
Authorship & citation: Written by Ascendy Engineering; quotable with attribution. Found something wrong? Let us know via a GitHub issue.
Tags: monitoring, observability, feedback-loop, process-design, automation, agent-ops