Nyx runs long, unattended marathons. The Supervisor is the layer that keeps them alive: an out-of-process, zero-dependency guardian that watches the engine from the outside, checks ground truth, and recovers or escalates — it never trusts the thing it is watching.
Marathons run for hours with no operator awake. The engine can livelock, leave a worker dead mid-increment, fail a merge, corrupt its own toolchain, or quietly stop ticking. Any guardian that lives inside that process shares its fate. So the Supervisor lives outside it.
One stuck increment hard-wedges the whole epic. A reparented worker holds gigabytes of RAM. The daemon stops ticking and reports nothing — because the thing meant to report is the thing that died. The operator finds it cold in the morning.
A separate process, supervised by the OS, reads objective state every tick. It sees the stall, restarts or re-pends the work, kills the zombie, and only pages a human when bounded retries are exhausted — with one message, not a firehose.
The Supervisor is not a module inside the daemon — it is a separate process owned by the operating system, looking in. It reads the engine's ground truth (git, processes, the database), acts on the daemon from the outside, and pages a human only as a last resort. When the daemon dies, the watcher does not.
Every tick the Supervisor establishes ground truth, acts within bounds, and records what it saw. The cycle never throws and never blocks on a human.
Six read-only detectors inspect objective state — git, processes, the DB — never the daemon's self-report.›
A matching bounded action restarts, re-pends, kills zombies, heals deps, or re-merges.›
Only when retries are exhausted: one deduplicated Telegram alert with exponential backoff.›
Every incident is written to a durable improvement backlog — file and DB — for the next build.
Each detector returns a finding from objective state and is engineered to never throw — malformed input, a half-written DB row, a missing process all resolve to a safe answer, because a guardian that crashes on bad data is no guardian.
An increment that has not advanced past its idle threshold while no worker is doing real work.
An epic blocked behind a single increment that the engine cannot get past on its own.
A verified plan that will not merge back — the classic overnight killer that halts everything after it.
Reparented worker processes holding memory with nobody reading their output.
A polluted node_modules or bad install that makes the verify gate fail for the wrong reason.
The daemon stopped ticking or stopped answering — detected from outside, where the daemon cannot hide it.
Each finding maps to a recovery action with idempotency probes and a hard retry ceiling. Recovery preserves work where it can; it never loops forever.
When a failure is in the work itself, the Supervisor reproduces it through a gate that does not depend on the daemon, spawns an independent fix-worker, and verifies the result against reality before trusting it.
The failure is reproduced and re-run through a gate that runs without the moderator, then confirmed against the live tree — so a fix is real, not a green-looking accident.
A repair worker is spawned outside the daemon and bounded by a semaphore, with a fast git-revert primitive as the always-available floor.
The Supervisor is built to be the one thing that stays up when everything else is on fire.
It runs as its own launchd job with run-at-load and keep-alive, so a daemon crash, a wedged epic, or a machine hiccup cannot take the watcher with it.
The core runtime has no npm dependencies — the exact failure class it guards against (a bad install) cannot disable it.
It retires the old marathon-watchdog and absorbs the memory reaper into a single authority, closing a process-group leak along the way.
Every path resolves to a safe action and logs it; it recovers silently and pages a human only when it has truly run out of moves.
Built by Nyx as a marathon, every increment gated by deterministic tests with injected seams — no live process, model, or launchd in the gate.
Dependency-free tick loop, the launchd plist, and the indestructibility contract.
The read-only ground-truth detection layer.
One Telegram channel with dedup and exponential backoff.
Restart, re-pend, kill-zombies, ensure-deps, remerge — with retry ceilings.
Independent gate re-run and a moderator-independent repair path.
Every incident persisted to file and DB for the next build to act on.
Sole watcher: retired the watchdog, absorbed the reaper, fixed a process-group leak.