Icinga: page in case all MediaWiki are throwing 5xx
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Volans
	Jan 30 2018, 11:05 PM

Description

In https://wikitech.wikimedia.org/wiki/Incident_documentation/20180129-MediaWiki even all the wikis where throwing 5xx we didn't get any page.

Details

Subject	Repo	Branch	Lines +/-
monitoring: page on low HTTP global availability	operations/puppet	production	+2 -0
prometheus: alert on per-site HTTP availability	operations/puppet	production	+35 -23
Restrict HTTP availability alerts to Varnish/Nginx text/upload	operations/puppet	production	+4 -4
alerts: add varnish/nginx HTTP availability	operations/puppet	production	+24 -0

Customize query in gerrit

Related Objects

Mentioned Here: T136839: Create a script to run test requests for the MediaWiki service

Event Timeline

Volans created this task.Jan 30 2018, 11:05 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 30 2018, 11:05 PM

Jdforrester-WMF subscribed.Jan 31 2018, 12:14 AM

Peachey88 added a project: Icinga.Jan 31 2018, 8:56 AM

Peachey88 added a project: Wikimedia-Incident.

Peachey88 moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.

So there are the existing checks for this, one for each data center. It would be easy to just flip those to "critical => true".

There are 2 different ones, reqstats_threshold and reqstats_anomaly.

modules/monitoring/manifests/graphite_threshold.pp:# metric => 'reqstats.5xx',
modules/monitoring/manifests/graphite_anomaly.pp:# description => 'Anomaly in number of 5xx responses',

But the way you phrased the ticket sounds like a logical AND, so we only want pages if all 4 checks per group, one for each dc, are critical at the same time, and not individual pages for say just "upload in esams". Is that right?

Dzahn triaged this task as High priority.Feb 1 2018, 11:30 PM

@Dzahn my understanding is that those have some defects, first to be a bit delayed, and second to have had in the past false positives. So either we improve the heuristic alarms to be reliable, quick and without false positives as much as possible, or we could consider a different alarm for the "it's all down" cases.

We do have functional checks for all services but MediaWiki. there is an old ticket ( T136839) about enabling swagger specs for MediaWiki thus allowing us to use service-checker to perform functional checks.

There is even a patch to mediawiki/core about that https://gerrit.wikimedia.org/r/#/c/307913/, but I lost track of where we stopped working on it. It's even probably my fault.

I would proably revive that work.

Change 408785 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] alerts: add varnish HTTP availability

https://gerrit.wikimedia.org/r/408785

gerritbot added a project: Patch-For-Review.Feb 7 2018, 11:55 AM

Tangentially related to this and something I wanted to experiment with, namely what the straw man in https://gerrit.wikimedia.org/r/408785 does to calculate an availability figure and alert on that instead. I'll work on fixing https://phabricator.wikimedia.org/T181410 too which we'd need for this.

Change 408785 merged by Filippo Giunchedi:
[operations/puppet@production] alerts: add varnish/nginx HTTP availability

https://gerrit.wikimedia.org/r/408785

Change 428305 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Restrict HTTP availability alerts to Varnish/Nginx text/upload

https://gerrit.wikimedia.org/r/428305

Change 428305 merged by Filippo Giunchedi:
[operations/puppet@production] Restrict HTTP availability alerts to Varnish/Nginx text/upload

https://gerrit.wikimedia.org/r/428305

Alerted today, real short-lived issue. Note that the alert is a single one even though its text can change over time (e.g. when more sites alert) so icinga needs to be instructed to re-alert whenever the text changes. Other improvements include printing the "worst" value found among all metrics that match the query.

15:01:47 PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_text site={codfw,ulsfo} 
15:03:37 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
15:03:47 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
15:04:18 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
15:05:07 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
15:05:47 RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: (No output returned from plugin)
15:11:38 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
15:12:27 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
15:13:07 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
15:13:48 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]

In T186069#4150128, @fgiunchedi wrote:

15:01:47 PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_text site={codfw,ulsfo} 
15:03:37 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
15:03:47 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
15:04:18 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
15:05:07 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
15:05:47 RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: (No output returned from plugin)
15:11:38 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
15:12:27 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
15:13:07 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
15:13:48 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]

The actual 503 spike lasted from 14:59 till 15:02. The new alert is much more accurate than the graphite-based one, which only reported a full recovery 11 minutes after the actual problem was gone.

There has been a spike of 500s yesterday in codfw, looks like from search.wikimedia.org (tracked at T193600)

19:34 -icinga-wm:#wikimedia-operations- PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is 
          CRITICAL: cluster=cache_text site=codfw 
          https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
19:38 -icinga-wm:#wikimedia-operations- PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is 
          CRITICAL: cluster=cache_text site=codfw 
          https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
19:39 -icinga-wm:#wikimedia-operations- PROBLEM - HTTP availability for Varnish on einsteinium is CRITICAL: 
          job=varnish-text site=codfw 
          https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
19:50 -icinga-wm:#wikimedia-operations- RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: 
          (No output returned from plugin) 
          https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1

Change 431542 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: alert on per-site HTTP availability

https://gerrit.wikimedia.org/r/431542

Change 431542 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: alert on per-site HTTP availability

https://gerrit.wikimedia.org/r/431542

Liuxinyu970226 subscribed.Jun 20 2018, 10:59 PM

ArielGlenn removed a project: Patch-For-Review.Nov 13 2018, 9:51 AM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 9:44 PM

We have availability-based alerts now (i.e. 5xx / all status codes) for varnish and ATS, those can be made paging now I believe as we haven't seen false positives with 99.5% (warn) and 99% (crit)

fgiunchedi moved this task from Inbox to Up next on the observability board.Dec 9 2019, 12:11 PM

Change 555987 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring: page on low HTTP global availability

https://gerrit.wikimedia.org/r/555987

gerritbot added a project: Patch-For-Review.Dec 9 2019, 4:45 PM

Change 555987 merged by Filippo Giunchedi:
[operations/puppet@production] monitoring: page on low HTTP global availability

https://gerrit.wikimedia.org/r/555987