SystemsDec 8, 20256 min read

The cron job that quietly stopped

Silent success is indistinguishable from silent failure, which is why every job that matters needs to tell you it ran.

Most jobs that fail loudly get fixed by Tuesday. The dangerous ones are the jobs that fail by not running at all — no error, no stack trace, no page at 3am. A cron job doesn't throw an exception when it doesn't fire. It just sits there, scheduled, and the scheduler quietly forgets it. Six weeks later someone asks why the weekly export is missing a month and a half of data, and you discover the job stopped the day someone renamed the server.

The trap is that healthy and dead look identical from the outside. A backup that ran perfectly produces silence. A backup that never started produces the same silence. You are not monitoring the job — you are monitoring the absence of complaints, and absence of complaints is the worst signal in software because it arrives on a delay measured in disasters.

Absence is not a signal you can observe

Every alerting setup I've inherited has the same blind spot. It watches for the thing that happened: the exception, the 500, the disk filling up. None of it watches for the thing that should have happened and didn't. You cannot write a log line from a process that never woke up. You cannot catch an error in code that never ran. The failure lives in the gap, and the gap emits nothing.

This is why "it'll alert us if something breaks" is a comforting lie for scheduled work. The alert is wired to the job's execution path. If the job doesn't execute, the path is never reached, and the alert is structurally incapable of firing. You built a smoke detector and put it inside the thing that's supposed to catch fire.

The fix is to invert the question. Stop asking "did the job report a failure?" and start asking "did the job report at all?" Health becomes a heartbeat someone outside the job is listening for, not a verdict the job renders on itself.

A job that can only tell you when it fails will tell you nothing on the day it matters most.

Make success speak, then watch for its silence

The pattern is a dead-man's switch. The job, on a successful run, pings an external watcher. The watcher knows the schedule. If the expected ping doesn't arrive inside the window, the watcher — not the job — raises the alarm. The thing that detects the problem is the one component guaranteed to still be alive.

You can buy this as a service or build it in an afternoon. The shape barely changes either way.

nightly-export.ts


await runNightlyExport();
await fetch("https://hc.example.com/ping/export-job", {
method: "POST",
});

The ping is the last line, not the first — it only fires if the real work finished.

Two details earn their keep. The ping goes last, after the work, so a success signal means success and not merely "the process started." And the watcher owns the schedule, because only something external to the job can notice the job's absence. Move the clock outside the thing being clocked.

The same discipline scales down. A job I rely on emails a one-line digest every morning — "export complete, 14,902 rows, 38s." Most days I delete it without reading it. That reflex is the point: the morning it doesn't arrive, I notice the hole in my inbox faster than any dashboard would have shown me. The boring confirmation trains the instinct that catches the silence.

Tell me it ran, and tell me enough

A heartbeat that only says "alive" is a start, but the richest signal is a confirmation that carries proof of work. Not "the job ran" but "the job ran and did the thing it exists to do."

→Count the output. Zero rows exported is a failure wearing a success costume.
→Compare to last time. A backup that's a tenth of yesterday's size ran, technically, and is still a catastrophe.
→Stamp the window. "Processed orders through 23:59" tells you coverage; "processed orders" tells you nothing.

I've watched a sync run green for a month while silently filtering out every record because an upstream field got renamed. The job succeeded. The job did nothing. Exit code zero is a statement about the process, not about the work, and the two diverge exactly when you can least afford it.

So the confirmation should say what changed, not just that the code reached the end. The cheapest version is a single line with a number in it. The number is what turns a heartbeat into evidence.

Every job that matters needs a witness — something outside it that expects to hear from it and gets loud when it doesn't. Build the witness before you need it, because the day you need it is the day you'll have no idea anything is wrong.

#Systems#ReliabilityShare ↗

→ / AUTHOR

Ionut Dumitru

Full-stack engineer and product designer. Writes about building products where the engineering and the design are the same job.

GitHub ↗X ↗

→ / NEXT

AIDec 1, 2025

Prompts are code, so treat them that way →

← All writingionutdumitru.com