Page MenuHomePhabricator

Icinga: page in case all MediaWiki are throwing 5xx
Closed, ResolvedPublic

Description

In https://wikitech.wikimedia.org/wiki/Incident_documentation/20180129-MediaWiki even all the wikis where throwing 5xx we didn't get any page.

Details

Related Gerrit Patches:

Event Timeline

Volans created this task.Jan 30 2018, 11:05 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 30 2018, 11:05 PM
Peachey88 added a project: Wikimedia-Incident.
Peachey88 moved this task from On-going to Follow-up on the Wikimedia-Incident board.
Dzahn added a subscriber: Dzahn.Feb 1 2018, 11:29 PM

So there are the existing checks for this, one for each data center. It would be easy to just flip those to "critical => true".

There are 2 different ones, reqstats_threshold and reqstats_anomaly.

modules/monitoring/manifests/graphite_threshold.pp:# metric => 'reqstats.5xx',
modules/monitoring/manifests/graphite_anomaly.pp:# description => 'Anomaly in number of 5xx responses',

But the way you phrased the ticket sounds like a logical AND, so we only want pages if all 4 checks per group, one for each dc, are critical at the same time, and not individual pages for say just "upload in esams". Is that right?

Dzahn triaged this task as High priority.Feb 1 2018, 11:30 PM

@Dzahn my understanding is that those have some defects, first to be a bit delayed, and second to have had in the past false positives. So either we improve the heuristic alarms to be reliable, quick and without false positives as much as possible, or we could consider a different alarm for the "it's all down" cases.

Joe added a subscriber: Joe.Feb 6 2018, 12:25 PM

We do have functional checks for all services but MediaWiki. there is an old ticket ( T136839) about enabling swagger specs for MediaWiki thus allowing us to use service-checker to perform functional checks.

There is even a patch to mediawiki/core about that https://gerrit.wikimedia.org/r/#/c/307913/, but I lost track of where we stopped working on it. It's even probably my fault.

I would proably revive that work.

Change 408785 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] alerts: add varnish HTTP availability

https://gerrit.wikimedia.org/r/408785

Tangentially related to this and something I wanted to experiment with, namely what the straw man in https://gerrit.wikimedia.org/r/408785 does to calculate an availability figure and alert on that instead. I'll work on fixing https://phabricator.wikimedia.org/T181410 too which we'd need for this.

Change 408785 merged by Filippo Giunchedi:
[operations/puppet@production] alerts: add varnish/nginx HTTP availability

https://gerrit.wikimedia.org/r/408785

Change 428305 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Restrict HTTP availability alerts to Varnish/Nginx text/upload

https://gerrit.wikimedia.org/r/428305

Change 428305 merged by Filippo Giunchedi:
[operations/puppet@production] Restrict HTTP availability alerts to Varnish/Nginx text/upload

https://gerrit.wikimedia.org/r/428305

fgiunchedi added a comment.EditedApr 23 2018, 3:23 PM

Alerted today, real short-lived issue. Note that the alert is a single one even though its text can change over time (e.g. when more sites alert) so icinga needs to be instructed to re-alert whenever the text changes. Other improvements include printing the "worst" value found among all metrics that match the query.

15:01:47 PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_text site={codfw,ulsfo} 
15:03:37 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
15:03:47 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
15:04:18 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
15:05:07 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
15:05:47 RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: (No output returned from plugin)
15:11:38 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
15:12:27 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
15:13:07 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
15:13:48 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
ema added a subscriber: ema.Apr 23 2018, 3:34 PM

Alerted today, real short-lived issue. Note that the alert is a single one even though its text can change over time (e.g. when more sites alert) so icinga needs to be instructed to re-alert whenever the text changes. Other improvements include printing the "worst" value found among all metrics that match the query.

15:01:47 PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_text site={codfw,ulsfo} 
15:03:37 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
15:03:47 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
15:04:18 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
15:05:07 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
15:05:47 RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: (No output returned from plugin)
15:11:38 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
15:12:27 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
15:13:07 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
15:13:48 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]

The actual 503 spike lasted from 14:59 till 15:02. The new alert is much more accurate than the graphite-based one, which only reported a full recovery 11 minutes after the actual problem was gone.

There has been a spike of 500s yesterday in codfw, looks like from search.wikimedia.org (tracked at T193600)

19:34 -icinga-wm:#wikimedia-operations- PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is 
          CRITICAL: cluster=cache_text site=codfw 
          https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
19:38 -icinga-wm:#wikimedia-operations- PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is 
          CRITICAL: cluster=cache_text site=codfw 
          https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
19:39 -icinga-wm:#wikimedia-operations- PROBLEM - HTTP availability for Varnish on einsteinium is CRITICAL: 
          job=varnish-text site=codfw 
          https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
19:50 -icinga-wm:#wikimedia-operations- RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: 
          (No output returned from plugin) 
          https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1

Change 431542 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: alert on per-site HTTP availability

https://gerrit.wikimedia.org/r/431542

Change 431542 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: alert on per-site HTTP availability

https://gerrit.wikimedia.org/r/431542

We have availability-based alerts now (i.e. 5xx / all status codes) for varnish and ATS, those can be made paging now I believe as we haven't seen false positives with 99.5% (warn) and 99% (crit)

fgiunchedi moved this task from Inbox to Up next on the observability board.Dec 9 2019, 12:11 PM

Change 555987 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring: page on low HTTP global availability

https://gerrit.wikimedia.org/r/555987

Change 555987 merged by Filippo Giunchedi:
[operations/puppet@production] monitoring: page on low HTTP global availability

https://gerrit.wikimedia.org/r/555987

fgiunchedi closed this task as Resolved.Dec 11 2019, 11:34 AM
fgiunchedi claimed this task.

Tentatively resolving, we'll be paging if >= 1% of global traffic is 5xx