fundamentals
DNS monitoring 101: what to check beyond the A record
DNS is the most reliable infrastructure in the world right up until the moment it's the only thing in the world that's broken, at which point absolutely nothing in your stack will work. The 2021 Facebook outage, the 2019 Google Cloud event, the 2021 Slack incident — all DNS-shaped. This post is about what to actually monitor at the DNS layer beyond "is the A record there."
Why DNS keeps causing outages
Three structural reasons. First, DNS changes propagate everywhere — through caches you don't own, ISPs you can't debug, browsers with their own resolvers. A wrong record is a wrong record for hours, not seconds. Second, DNS is invisible until it fails: nothing reminds you of it on a good day, so the knowledge atrophies. Third, every "simple" DNS change touches multiple records (A, AAAA, CNAME, MX, TXT, NS) and misconfiguring any one of them produces a different breakage.
What you should actually monitor
The minimum useful DNS monitoring set is six things. Most teams have one or two.
1. The A and AAAA records resolve to the expected IPs
From multiple resolvers (Google's 8.8.8.8, Cloudflare's 1.1.1.1, Quad9, plus regional resolvers). If the answer differs across resolvers, you've got a propagation issue. A common alert: "example.com resolves to a different IP than the expected pool of three."
2. The CNAME chain is intact and short
CNAME → CNAME → CNAME → A is legal but slow and fragile. Each hop is another resolver round trip and another opportunity for a TTL mismatch. Monitor that the chain resolves and that it's no deeper than 2 hops for production records.
3. Nameservers respond
Query each of the authoritative nameservers directly (not via a resolver). If one of your four nameservers stops responding, your domain is at 75% capacity; if two stop, you're minutes from a full outage. Tools like dig +trace follow the chain from root to the authoritative servers — that's the right diagnostic at 3am.
4. TTLs are sane
Before a planned migration, lower TTLs to 60-300 seconds so propagation is fast when you cut over. After the migration, raise them back to 3600+ to reduce query load and resolver fanout. The most common mistake: leaving 86400-second TTLs in place forever, which means rollbacks take a full day.
5. SPF, DKIM, DMARC for email
Email deliverability lives in DNS. Monitor the TXT records: an SPF record that exceeds 10 DNS lookups silently breaks. A DKIM key that rotated on the email provider's end without updating yours starts failing. DMARC reports going to a dead inbox tell you nothing.
6. DNSSEC chain (if you use it)
DNSSEC is great when it works and catastrophic when it doesn't: a broken signature makes your domain unresolvable for DNSSEC-validating resolvers. Monitor the chain of trust from the root through your domain's DS records. Vendors like Cloudflare do most of this for you, but you still need an alert if it breaks.
Propagation: the part that confuses everyone
"DNS hasn't propagated yet" is the most overloaded sentence in operations. There are actually three different things people mean by it:
- Authoritative propagation: how long until all of your nameservers serve the new record. Usually seconds; if it's slow your DNS provider has a replication problem.
- Resolver cache expiry: how long until upstream caching resolvers (ISPs, public resolvers) refresh. Driven by the TTL on the old record. This is where the wait happens.
- Client cache expiry: browsers and operating systems also cache. Chrome caches for 60 seconds; some Java JVMs cache forever by default. This is the longest tail.
Practical implication: monitoring from multiple public resolvers catches authoritative propagation. To catch the long tail, you need synthetic checks from real-world locations (ideally on real broadband connections, not just from data centers).
The DNS incidents to prepare for
The TTL mismatch
You point www at a new load balancer. Most of the world updates within an hour. A long-tail 5% sees the old IP for 24 hours. If the old IP is dead, those users see a flat failure with no obvious explanation. Lower TTLs before any change you might roll back.
The wildcard takeover
You delete a subdomain's CNAME pointing at a SaaS vendor — but the wildcard *.example.com still resolves to the same vendor. An attacker registers the subdomain on that vendor's platform and now they serve content from your domain. Monitor for unexpected resolution of wildcard subdomains; never leave dangling CNAMEs.
The registrar lock
A misclick in the registrar UI removes the transfer lock. Six months later the domain is hijacked. Monitor registrar status: many registrars expose a status field via API or RDAP that should always read clientTransferProhibited for production domains.
The single-provider failure
If all four of your authoritative nameservers are at one provider, a provider-level outage takes you out. Use a secondary DNS provider (or at least split nameservers between two providers) for any production domain. Monitor that both sets of nameservers serve the same records.
Tools you'll reach for
dig— the workhorse.dig @1.1.1.1 example.comasks a specific resolver.dig +trace example.comwalks from root to authoritative.nslookup— Windows' default; mostly the same as dig with a worse interface.kdig— modernized dig with DNSSEC and TLS support; worth installing on a debug box.- DNSViz (dnsviz.net) — a visual chain-of-trust tool; the fastest way to find a DNSSEC break.
- Hosted monitors — Uptimera, DNSPerf, Catchpoint, and others run continuous queries against your records from many vantage points.
What to alert on
High-signal alerts (page someone immediately):
- Any A or AAAA record returns a different IP than the expected set.
- Two or more authoritative nameservers fail to respond.
- DNSSEC validation fails from any major resolver.
- SSL cert hostname mismatch (your DNS now points at a host that doesn't serve the right cert).
Lower-priority alerts (warn during business hours):
- TTL on a critical record is above 3600 seconds 24 hours before a scheduled change.
- SPF record exceeds 8 lookups (you have 2 of budget left).
- WHOIS shows registrar lock has been removed.
- Any record changed without a corresponding change-management ticket (only if you have that integration set up).
Where to go from here
Even adding the first three monitors — A record consistency, nameserver liveness, DNSSEC validity if you use it — catches the majority of DNS-shaped outages. The remaining three (TTL hygiene, email records, registrar status) are quarterly check-ins more than continuous monitors.
Uptimera includes DNS checks in every plan; configure them against any domain you own from the dashboard.