What
For all databases reachable from MediaWiki (core DBs, external store, etc.), and all tables, determine the max of any auto-increment fields (int, bigint, etc.), and compare that against the latest/highest value.
An alert should fire if the calculated percentage reaches a certain threshold.
Why
- Reduce lack of confidence from not knowing, or feelng of stress from checking manually just in case.
- Avoid surprises.
Related tasks:
- T62962: The primary key of recentchanges (rc_id) table should be unsigned
- T63111: Convert primary key integers and references thereto from int to bigint (unsigned)
- T283069: cu_changes auto_increment will hit its max value in wikidatawiki soon
How
- Some kind of script that runs on a regular basis. Perhaps a maintenance script in WikimediaMaintenance that runs once a day. Or a standalone Python script maintained in Puppet.
- Send data to Prometheus. Possibly skipping over any results that are below a minimum telemetry threshold so as to leave out most data about most tables from most of the smaller wikis. Most tables on most wikis have the same schema, and most wikis will not be the one to first to reach the alert threshold for a given logical table. We only need to know about the alerting threshold being reached once in order to change the schema for all wikis.
- Define an alert query somewhere.
- Have it notify something/someone.
Note that something like this already exists for Wikidata specifically, as created by @Ladsgroup (T274419). This currently works through a cronjob that runs on stat* machines against the analytics replicas, and sends daily metrics to Graphite:
- https://gerrit.wikimedia.org/g/analytics/wmde/scripts/
- https://grafana.wikimedia.org/d/79S1Hq9Mz/wikidata-reliability-metrics?viewPanel=29
Open questions
- Where to run and what to run against? (strawman based on Amir's work would be: stat machine, against analytics replicas).
- What should the minimum telemetry threshold be, to reduce metric cardinality?
- How do we want to handle notification and alerting?
- Do we want a critical paging/alerting threshold? What should it be at? What is the runbook for the emergency use case, and how many hours/days would it take to get done for a large table?
I think the ideal is to find out about these things at least three months in advance so that they can be done without rushing/stress/urgency, as part of regular maintenance work. In order for that to happen consistency, it somewhat paradoxically also requires it to not be too passive or rely on someone being nice and looking at it all the time for us.
Perhaps the ideal would be the automatic one-time creation of a task. That seems both reliable, not paging anyone, and also not reliant on someone reviewing a source of warnings by hand.
Another way might be a "warning" threshold in Icinga that perhaps shows up in AlertManager and IRC without emailing/paging anyone. I believe we now have rotating duties to triage AlertManager so maybe that would suffice.