Page MenuHomePhabricator

Enable paging for Gerrit (was: Gerrit outage didn't page until 4.5 hours after the first alert)
Closed, ResolvedPublic

Description

T423027: 2026-04-12 Gerrit Outage (was: DiskSpace) was filed at 10:08 UTC and the first user report and became an outage at around 14:15 UTC. It didn't trigger a paging alert until

14:38:51 <+jinxer-wm> FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from gerrit.discovery.wmnet in eqiad #_page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh

Note: It was also rapidly rising in disk space from around midnight UTC so there was a 10 hour gap where something wrong could have been detected even earlier.

That's around 15 minutes after the first user report and 4.5 hours after automated monitoring detected a problem. It probably should have gone off a bit louder given Gerrit is a fairly critical part of Infrastructure and it caused secondary alerting from at least authdns-update failing and CI issues.

Details

Related Changes in Gerrit:

Event Timeline

For the record, T423027 was originally automatically opened to report:

* **summary**: Disk space gerrit2003:9100:/ 3.822% free

thank you for opening this task. For additional context: paging alerts for Gerrit were also previously discussed in T365148.

(rm'ing as a subtask, as iiuc this is a follow-up task rather than something that needs to be done before the incident can be resolved)

LSobanski closed this task as a duplicate of Restricted Task.Apr 13 2026, 3:52 PM

Reopening this task after all as it's more relevant at this time.

LSobanski merged a task: Restricted Task.
LSobanski added subscribers: ssingh, ABran-WMF, ArielGlenn and 7 others.
LSobanski triaged this task as Medium priority.Thu, Apr 16, 4:23 PM

Quoting @Jelto from T365148:

I'm adding the Sustainability (Incident Followup) tag because of the last incident T423027 and raising the priority. We should decide if we want to page for Gerrit (in general/outside of office hours).

Need technical work:
Moving the service behind the CDN (T365259)

Gerrit is behind the CDN now.

Also there is a built-in alert because of that for ATSBackendErrorsHigh cache_text sre (gerrit.discovery.wmnet). So we are kind-of paging for Gerrit.

Document a common set of remediation steps that can be safely done by a responder (together with RelEng)

This is also a follow-up of T423027. We need runbooks and update the existing documentation.

LSobanski renamed this task from Gerrit outage didn't page until 4.5 hours after the first alert to Enable paging for Gerrit (was: Gerrit outage didn't page until 4.5 hours after the first alert).Mon, Apr 20, 3:58 PM

Next steps

  • Create a separate blackbox check with severity:page (in addition to the existing IRC one)
  • Create an alert to fire after the above check is firing for a set time (15 minutes?)

Change #1278238 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: add paging blackbox check

https://gerrit.wikimedia.org/r/1278238

ABran-WMF changed the task status from Open to In Progress.Tue, Apr 28, 6:57 AM
ABran-WMF moved this task from Work in Progress to Awaiting Input on the collaboration-services board.

Change #1278238 merged by Arnaudb:

[operations/puppet@production] gerrit: add paging blackbox check

https://gerrit.wikimedia.org/r/1278238

a page will be sent if the black box monitoring check fails for more than 15 minutes