All posts

operations

SSL certificate monitoring: never get caught by an expiry again

Uptimera team8 min read

TLS certificate expiries should be a solved problem. They aren't. In the past three years a major airline, a video conferencing provider, a national bank, and at least two large telcos have all gone dark because a certificate they owned expired without anyone noticing. Auto-renewal helps; it doesn't eliminate the risk. This post walks through what to monitor at the TLS layer beyond just the expiry date.

Why expiries still happen in 2026

With Let's Encrypt, ACME, and short-lived certs as the default, you'd think the problem is over. The five reasons it isn't:

  • Non-public certs. Internal services, mTLS between microservices, certs for staging environments — none of these are issued via public CAs and none of them auto-renew unless your team built the renewal themselves.
  • Legacy long-lived certs. The 2-year EV cert that someone bought before automation existed. It works fine; nobody touches it; it expires on a Saturday.
  • ACME challenge failures. Auto-renewal needs the ACME HTTP-01 or DNS-01 challenge to work. A misconfigured rewrite rule, a CNAME pointing somewhere new, a DNS provider that doesn't support the right API — and the renewal silently fails 30 days before expiry.
  • Pinned certificates. Mobile apps that pin a specific cert. When you rotate, the pinned app stops trusting you — visible only when users open it.
  • CT log issues. A cert issued without being logged to Certificate Transparency will be rejected by modern browsers. Rare but it happens during outages at smaller CAs.

What to monitor (beyond the expiry date)

A robust TLS monitor checks six things:

1. Days until expiry

The obvious one. The standard warning schedule: 30 days, 14 days, 7 days, 3 days, 1 day. Don't collapse this into a single alert — escalate it. If the 30-day alert was missed (it usually is), the 7-day alert should land somewhere harder to ignore than the 30-day one.

2. Hostname coverage

The cert's Subject Alternative Names (SANs) must match the hostname you're serving from. A common failure: you add www.example.com in production but the cert only covers example.com. Browsers reject the connection with NET::ERR_CERT_COMMON_NAME_INVALID.

3. Chain completeness

Your server must serve the full chain (leaf + intermediates). Missing intermediates is the single most common misconfiguration — and the most insidious, because most development browsers cache intermediates and gloss over the problem. Mobile devices and curl don't. Monitor with openssl s_client -connect host:443 -showcerts and assert the chain has at least 2 certs.

4. Signature validity

Each cert in the chain must validate against the next. A revoked intermediate (rare but it happens, e.g. the Let's Encrypt cross-signed root sunset in 2024) breaks the whole chain.

5. Protocol and cipher health

TLS 1.0 and 1.1 are deprecated and disabled in modern browsers. Your server still accepting them is a configuration drift signal even if not an outright failure. Monitor for: TLS 1.2 minimum, no NULL or anonymous ciphers, no export-grade ciphers, no RC4.

6. OCSP / OCSP stapling

OCSP (Online Certificate Status Protocol) tells the client whether a cert has been revoked. OCSP stapling lets your server include that proof in the handshake so the client doesn't have to call the CA. If stapling is configured but failing, clients fall back to direct OCSP queries — slower, sometimes blocked. Monitor for successful stapled OCSP responses.

A reasonable warning schedule

Different teams use different cadences. A defensive default that catches the things that go wrong:

  • 30 days: ticket created in tracker, assigned to the cert's owner. Low priority.
  • 14 days: Slack message to the on-call channel. Medium priority. If auto-renewal is on, this is where you confirm it actually renewed.
  • 7 days: page the cert owner (or whoever owns the service). High priority. At this point you have one week of business hours to renew manually if automation has failed.
  • 3 days: page the on-call. Critical. The on-call may not know the renewal process, but they can escalate.
  • 1 day: page everyone who could plausibly help. There is no time for politeness at this point.

Auto-renewal failure modes to watch

If you use Let's Encrypt with cert-manager, certbot, or any ACME client, these are the patterns that silently break:

  • Rate limits. Let's Encrypt limits new orders per registered domain per week. A deploy that creates many similar certs (one per preview branch) can hit the limit; subsequent renewals fail until the rate window resets.
  • HTTP-01 challenge blocked. The challenge path /.well-known/acme-challenge/ must be reachable without auth, without HTTPS redirect, and without rewriting. A CDN rule that redirects all HTTP to HTTPS will block the challenge.
  • DNS-01 challenge stale. DNS-01 requires inserting a TXT record. If the cleanup step from a previous renewal failed, you have stale records lingering; some CAs reject orders with conflicting records.
  • Permissions drift. The ACME client's API key for your DNS provider expires or gets rotated. Renewal silently fails for 60 days until expiry.

Tools to do this

  • Quick check from a terminal: openssl s_client -connect example.com:443 -servername example.com shows the cert and chain. echo | openssl s_client ... 2>/dev/null | openssl x509 -noout -dates isolates the validity dates.
  • In-browser: click the padlock; modern browsers show the cert subject, issuer, validity, and SANs.
  • SSL Labs: the canonical third-party scanner. Run it before any major TLS change. Slow, but produces the most complete report.
  • Uptimera SSL checker: our free SSL checker covers all six items above and runs from multiple regions.
  • Continuous monitoring: configure SSL monitors as part of the same monitoring system that watches your uptime. The cert check should be as first-class as the HTTP check.

Where to go from here

List every TLS certificate your team is responsible for. Public sites, internal services, mTLS between services, mobile app pins. For each one: who renews it, how, and what monitoring is in place. Half the items on that list usually don't have all three answers. That's where this work pays for itself.