March 14th, 2026

New

SRE - Uptime Monitoring, Status Boards & Incident Management

Uptime monitoring

  • HTTP/HTTPS checks against endpoints with configurable intervals, timeouts, and expected status codes.

  • Regional or multi-location checks where supported, with latency and availability history.

  • Alerting when targets become unreachable or return unexpected responses.

Heartbeat monitoring

  • Cron and job-style heartbeats: services report success on a schedule; missed beats raise incidents.

  • Grace periods and recovery when jobs resume after transient failures.

SSL monitoring

  • Certificate expiry tracking with advance warnings before renewal deadlines.

  • Validation of chain and hostname mismatches where applicable.

ICMP / ping monitoring

  • Reachability checks via ICMP echo (ping) for hosts and infrastructure that expose ICMP.

  • Packet loss and round-trip time trends for network health visibility.

Incident management / Maintenance updates

  • Incident lifecycle: open, acknowledge, resolve, with timelines and ownership.

  • Correlation of alerts into incidents and post-incident context for outages.

Status pages / boards

  • Public or internal status views reflecting current service health and ongoing incidents.

  • Historical uptime and incident communication in one place for stakeholders.

  • Aggregate public vendor status and incident feeds in one place—see when AWS, GitHub, Datadog, or other dependencies are down without tab-hopping.