Page MenuHomePhabricator

Introduce alerting to monitor mediawiki databases QPS rate of change
Open, MediumPublic

Description

On 14 April, a refactor of mediawiki-BagOStuff was deployed which introduced a bug that caused revision text blobs to no longer be cached in Memcached. Over a period of time, the amount of traffic sent to External Stores (wiki content databases) increased to the point of almost breakage of the database infrastructure:

Screenshot from 2021-04-29 18-54-36.png (733×1 px, 227 KB)

https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=1&orgId=1&var-site=eqiad&var-group=core&var-shard=es1&var-shard=es2&var-shard=es3&var-shard=es4&var-shard=es5&var-role=All&from=1618370189989&to=1619816362678

Screenshot from 2021-05-14 09-10-43.png (1×2 px, 740 KB)

https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=1&orgId=1&from=now-30d&to=now&var-site=eqiad&var-group=core&var-shard=All&var-role=All

More details on: https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-04-29_db_and_memc_load and T281480

The issue could have been detected early if there was some kind of monitoring of Rate of change / week over week / prediction alarming for QPS.

Identify the best way to monitor this, in order to make meaningful alarms while preventing false positives (alert spam for non-impacting changes) and if there is a reasonable solution, implement it into production.

Event Timeline

LSobanski triaged this task as Medium priority.May 4 2021, 12:25 PM
LSobanski created this task.
LSobanski moved this task from Triage to Backlog on the DBA board.
jcrespo renamed this task from QPS rate of change alarming to Introduce alarming to monitor mediawiki databases QPS rate of change.May 14 2021, 7:17 AM
jcrespo updated the task description. (Show Details)

Adding @Krinkle as he was involved in the ES issue, and has helped us a lot in the past with MySQL graphs.

This is very much WIP, but wanted to share it with you too, as you may find it interesting, useful. Still a lot of work to refine (axis, visualization, labels, etc.). We will only add alerting if we can reliably detect anomalies- and sadly right now we have normal "spikes" due to backups running, dumps, specially on low traffic sections.

Krinkle renamed this task from Introduce alarming to monitor mediawiki databases QPS rate of change to Introduce alerting to monitor mediawiki databases QPS rate of change.May 14 2021, 8:42 PM
Krinkle added a project: Performance-Team.