Nyx Supervisor — the watcher that cannot be wedged

The problem

A watcher inside the daemon dies with the daemon.

Marathons run for hours with no operator awake. The engine can livelock, leave a worker dead mid-increment, fail a merge, corrupt its own toolchain, or quietly stop ticking. Any guardian that lives inside that process shares its fate. So the Supervisor lives outside it.

Without it

One stuck increment hard-wedges the whole epic. A reparented worker holds gigabytes of RAM. The daemon stops ticking and reports nothing — because the thing meant to report is the thing that died. The operator finds it cold in the morning.

With it

A separate process, supervised by the OS, reads objective state every tick. It sees the stall, restarts or re-pends the work, kills the zombie, and only pages a human when bounded retries are exhausted — with one message, not a firehose.

Where it sits

Outside the blast radius.

The Supervisor is not a module inside the daemon — it is a separate process owned by the operating system, looking in. It reads the engine's ground truth (git, processes, the database), acts on the daemon from the outside, and pages a human only as a last resort. When the daemon dies, the watcher does not.

reads ground truth (never the daemon's self-report) recovery actions human escalation

The watcher lives under launchd, outside the daemon it guards.

How it works

Detect → Recover → Escalate → Learn.

Every tick the Supervisor establishes ground truth, acts within bounds, and records what it saw. The cycle never throws and never blocks on a human.

Detect

Six read-only detectors inspect objective state — git, processes, the DB — never the daemon's self-report.›

Recover

A matching bounded action restarts, re-pends, kills zombies, heals deps, or re-merges.›

Escalate

Only when retries are exhausted: one deduplicated Telegram alert with exponential backoff.›

Learn

Every incident is written to a durable improvement backlog — file and DB — for the next build.

Ground truth · never-throw

Six detectors that read reality, not reports.

Each detector returns a finding from objective state and is engineered to never throw — malformed input, a half-written DB row, a missing process all resolve to a safe answer, because a guardian that crashes on bad data is no guardian.

stalled-increment

No forward motion

An increment that has not advanced past its idle threshold while no worker is doing real work.

wedged-epic

Whole epic stuck

An epic blocked behind a single increment that the engine cannot get past on its own.

failed-merge

Integration broke

A verified plan that will not merge back — the classic overnight killer that halts everything after it.

zombie-workers

Orphaned + heavy

Reparented worker processes holding memory with nobody reading their output.

broken-toolchain

Self-inflicted breakage

A polluted node_modules or bad install that makes the verify gate fail for the wrong reason.

moderator-health

Is it even alive?

The daemon stopped ticking or stopped answering — detected from outside, where the daemon cannot hide it.

Bounded recovery

Acts within limits, then asks for help.

Each finding maps to a recovery action with idempotency probes and a hard retry ceiling. Recovery preserves work where it can; it never loops forever.

restart-moderator

Brings the daemon back with pre- and post-probes so a restart is idempotent and never doubles up.

repend-increment

Cancels a stalled increment and re-pends it so the engine retries it cleanly instead of wedging the epic.

kill-zombies

Reaps orphaned, heavy worker processes — the same RAM guard now lives inside the Supervisor.

ensure-deps

Repairs a broken or polluted toolchain so the verify gate fails only for real reasons.

remerge

Re-attempts a failed integration against a known-good base instead of leaving the branch stuck.

Code self-heal

It can fix the code, on its own terms.

When a failure is in the work itself, the Supervisor reproduces it through a gate that does not depend on the daemon, spawns an independent fix-worker, and verifies the result against reality before trusting it.

Independent gate

Reproduce → gate → confirm-live

The failure is reproduced and re-run through a gate that runs without the moderator, then confirmed against the live tree — so a fix is real, not a green-looking accident.

Fix-worker

Moderator-independent, semaphore-gated

A repair worker is spawned outside the daemon and bounded by a semaphore, with a fast git-revert primitive as the always-available floor.

Why it survives

Indestructible by construction.

The Supervisor is built to be the one thing that stays up when everything else is on fire.

Out of process

Owned by launchd, not the daemon

It runs as its own launchd job with run-at-load and keep-alive, so a daemon crash, a wedged epic, or a machine hiccup cannot take the watcher with it.

Zero dependencies

Nothing to break in the core

The core runtime has no npm dependencies — the exact failure class it guards against (a bad install) cannot disable it.

One watcher

Sole guardian, no overlap

It retires the old marathon-watchdog and absorbs the memory reaper into a single authority, closing a process-group leak along the way.

Never blocks

Fail-open, fail-loud

Every path resolves to a safe action and logs it; it recovers silently and pages a human only when it has truly run out of moves.

How it was built

Seven increments, each green before the next.

Built by Nyx as a marathon, every increment gated by deterministic tests with injected seams — no live process, model, or launchd in the gate.

Core runtime + launchd + README

Dependency-free tick loop, the launchd plist, and the indestructibility contract.

Six never-throw detectors

The read-only ground-truth detection layer.

Consolidated escalation

One Telegram channel with dedup and exponential backoff.

Bounded recovery

Restart, re-pend, kill-zombies, ensure-deps, remerge — with retry ceilings.

Self-heal + completion gate + fix-worker

Independent gate re-run and a moderator-independent repair path.

Post-incident learning backlog

Every incident persisted to file and DB for the next build to act on.

Consolidation + cutover

Sole watcher: retired the watchdog, absorbed the reaper, fixed a process-group leak.