DNS is always the problem, eventually
Half of debugging infrastructure is just remembering to check DNS an hour earlier than you did.
Most outages start the same way. Something that worked at 4 p.m. stops working at 4:15, nobody deployed anything, and the logs are full of timeouts that point everywhere and nowhere. You stare at application code, then at the load balancer, then at the database connection pool. An hour later, exhausted, you finally run dig against the thing you've been assuming was fine — and there it is. The record changed, the TTL expired, the resolver is handing back stale or wrong answers. DNS was the problem. DNS is usually the problem.
I've stopped treating this as a punchline and started treating it as a checklist. The joke that "it's always DNS" is true often enough that the only real mistake is checking it last. Half of debugging infrastructure is just remembering to look an hour earlier than you did.
Why it hides so well
DNS is the one dependency every other system quietly assumes is free and instant. Your service connects to a database by hostname. Your load balancer health-checks an upstream by hostname. Your CI runner pulls an image from a registry by hostname. None of that code mentions DNS, so none of your mental models include it. It's the plumbing behind the wall — invisible until it leaks.
It also fails in ways designed to mislead. A bad deploy fails immediately and loudly. DNS fails on a delay, partially, and inconsistently:
- →A record changes but a resolver three hops away is still serving the old answer until its TTL expires.
- →Half your pods get the new IP and half get the old one, so the bug is real for 50% of traffic and invisible to whoever's debugging the other 50%.
- →A negative cache pins a
NXDOMAINfor five minutes after the record already exists, so "it's fixed" and "it's still broken" are both true.
Every one of those symptoms looks like an application bug, a network partition, or a flaky dependency. You chase the symptom because the symptom is where the error message points. The error message is lying to you — not maliciously, just because the failure happened a layer below where your logs can see.
The error message points at the symptom because the failure happened a layer below where your logs can see.
The cache is the whole story
Almost every confusing DNS incident is really a caching incident. There is no single answer to "what does this name resolve to" — there are a dozen answers, one per resolver, each with its own expiry. The authoritative record is correct, and every layer between you and it can still be wrong for a while.
This is why "but I already updated the record" is the most dangerous sentence in an incident. Updating the record is the easy part. What matters is who is still holding the old one, and for how long. A 24-hour TTL you set six months ago and forgot about is a 24-hour outage waiting for the day you need to move fast.
The practical defenses are boring and they work:
- →Lower TTLs before a planned migration, not during it. The change has to propagate before it can help you, so you need the low TTL live a full old-TTL-period ahead.
- →When you debug, query the authoritative server directly and then a public resolver, and compare. If they disagree, you've found your layer. If they agree, move on — fast.
- →Trust
dig, not your application. The app caches inside the runtime too. A JVM with a defaultnetworkaddress.cache.ttlwill happily pin a resolution for the life of the process and never tell you.
That last one bites hardest. You can fix the record, flush every resolver you control, confirm with dig from the same host — and the service is still broken because the process resolved the name once at startup and cached it in memory for the next three weeks.
Make it the first question, not the last
The fix isn't more DNS knowledge. It's reordering your instincts. When something that worked stops working and nothing shipped, resolution is one of a handful of things that can change underneath you on its own — certificates expiring, disks filling, and DNS records or caches turning over. Those deserve to be checked first precisely because nobody touched them, which is exactly why they don't show up in the deploy log you're scanning for a culprit.
So I keep a thirty-second reflex at the top of the runbook: resolve the hostname from the affected host, resolve it from the authoritative server, and compare the answers and the TTLs. If they match and the timing is sane, I've spent thirty seconds to rule out the most common cause of "it worked yesterday." If they don't match, I've just skipped the hour I would otherwise have spent reading application traces that were never going to mention the real problem.
DNS isn't special because it breaks more than other systems. It breaks about as often as anything else. It's special because it breaks quietly, on a delay, behind a cache you forgot existed — and because checking it costs almost nothing. The lesson was never "DNS is fragile." The lesson is that the cheapest check is the one you keep saving for last.
