Page MenuHomePhabricator

Alert when auto-increment fields on any MW-related databases reach a threshold
Open, MediumPublic

Description

What

For all databases reachable from MediaWiki (core DBs, external store, etc.), and all tables, determine the max of any auto-increment fields (int, bigint, etc.), and compare that against the latest/highest value.

An alert should fire if the calculated percentage reaches a certain threshold.

Why
  • Reduce lack of confidence from not knowing, or feelng of stress from checking manually just in case.
  • Avoid surprises.

Related tasks:

How
  • Some kind of script that runs on a regular basis. Perhaps a maintenance script in WikimediaMaintenance that runs once a day. Or a standalone Python script maintained in Puppet.
  • Send data to Prometheus. Possibly skipping over any results that are below a minimum telemetry threshold so as to leave out most data about most tables from most of the smaller wikis. Most tables on most wikis have the same schema, and most wikis will not be the one to first to reach the alert threshold for a given logical table. We only need to know about the alerting threshold being reached once in order to change the schema for all wikis.
  • Define an alert query somewhere.
  • Have it notify something/someone.

Note that something like this already exists for Wikidata specifically, as created by @Ladsgroup (T274419). This currently works through a cronjob that runs on stat* machines against the analytics replicas, and sends daily metrics to Graphite:

Open questions
  • Where to run and what to run against? (strawman based on Amir's work would be: stat machine, against analytics replicas).
  • What should the minimum telemetry threshold be, to reduce metric cardinality?
  • How do we want to handle notification and alerting?
  • Do we want a critical paging/alerting threshold? What should it be at? What is the runbook for the emergency use case, and how many hours/days would it take to get done for a large table?

I think the ideal is to find out about these things at least three months in advance so that they can be done without rushing/stress/urgency, as part of regular maintenance work. In order for that to happen consistency, it somewhat paradoxically also requires it to not be too passive or rely on someone being nice and looking at it all the time for us.

Perhaps the ideal would be the automatic one-time creation of a task. That seems both reliable, not paging anyone, and also not reliant on someone reviewing a source of warnings by hand.

Another way might be a "warning" threshold in Icinga that perhaps shows up in AlertManager and IRC without emailing/paging anyone. I believe we now have rotating duties to triage AlertManager so maybe that would suffice.

Event Timeline

Marostegui moved this task from Triage to Refine on the DBA board.
Marostegui subscribed.

I think this can be useful. Not sure about how to implement this though, probably through a MW script that would go iterate over all section/wikis/tables. This is likely to take a lot more than 24h to run, so running it daily is probably impossible. A weekly run is probably enough.

Also not sure if this needs an Icinga alert, but just an email (if the threshold are good) is probably sufficient.

We need to keep in mind that this MW script should probably pick a vslow host, rathern than a random replica.

We need to keep in mind that this MW script should probably pick a vslow host, rathern than a random replica.

We can use analytics dbs, that's how we did it for Wikidata.

I am going to move this into "Ready" column as I think we are ready to implement this (how to implement it is a different story)

So, I haven't had time to decide where to implement this. I honestly think this entangles with broader discussion around how o11y for databases should look like. There are many options like mw script in WikimediaMaintaince extension.

In the mean time, to avoid any outages caused by this. I wrote a script to go through every wiki and check that.
This is list of tables that are above 10% of their max value:

{
	"dewiki geo_tags": 55.998,
	"commonswiki recentchanges": 42.301,
	"wikidatawiki recentchanges": 36.827,
	"wikidatawiki cu_changes": 36.543,
	"wikidatawiki text": 36.184,
	"wikidatawiki revision": 35.702,
	"wikidatawiki wb_changes": 35.021,
	"enwiki recentchanges": 33.674,
	"commonswiki externallinks": 27.342,
	"enwiki cu_changes": 25.303,
	"enwiki text": 24.915,
	"enwiki revision": 24.61,
	"enwiki geo_tags": 18.243,
	"wikidatawiki wbt_term_in_lang": 16.79,
	"wikidatawiki wbt_text_in_lang": 16.697,
	"enwiki externallinks": 16.611,
	"commonswiki geo_tags": 15.977,
	"wikidatawiki logging": 15.5,
	"commonswiki cu_changes": 14.555,
	"commonswiki text": 14.256,
	"commonswiki revision": 14.195,
	"commonswiki uw_campaigns": 13.735,
	"wikidatawiki change_tag": 13.193,
	"frwiki geo_tags": 11.096,
	"enwiki watchlist": 10.739,
	"frwiki recentchanges": 10.564
}

Adding this here, in case someone else was confused on how some of those values could shrink (e.g. compared to T63111#5782953) and apparently, some PKs were doubled by being converted to unsigned (e.g. wikidata.rcs).

I honestly think this entangles with broader discussion around how o11y for databases should look like

Indeed.

I was about to suggest graphite as a temporary location (in order not to add another storage solution, until a more permanent decision is taken) but I just realized it has the same issue as prometheus/grafana - is a public store (only the gui is under login). :-(.

Adding this here, in case someone else was confused on how some of those values could shrink (e.g. compared to T63111#5782953) and apparently, some PKs were doubled by being converted to unsigned (e.g. wikidata.rcs).

Yes. For example after building this for wikidatawiki. We found a dangerously high one in checkuser extension and fixed it: T283093: Schema change for making cuc_id in cu_changes unsigned

I honestly think this entangles with broader discussion around how o11y for databases should look like

Indeed.

I was about to suggest graphite as a temporary location (in order not to add another storage solution, until a more permanent decision is taken) but I just realized it has the same issue as prometheus/grafana - is a public store (only the gui is under login). :-(.

That part is sorta fine, we can only send the value if it's higher than a percentage but the problem is that I need a place to run some scripts and send them to prometheus/graphite.

if it's higher than a percentage

Ah, that would fix the issue with skipping small wikis.

I need a place to run some scripts

I think if it is for mw, mwmaintenance is the right place? If it is for all dbs with root (may be useful in the long term for all apps), probably profile::wmf_root_client aka cumin?

I need a place to run some scripts

I think if it is for mw, mwmaintenance is the right place? If it is for all dbs with root (may be useful in the long term for all apps), probably profile::wmf_root_client aka cumin?

It's a bit offtopic here but the question is that should we run it for non-mw dbs? If so, how the discovery should look like? etc.

And another point is that this is not the first script, nor the last. e.g. I want to have drift tracker get automatically updated. I remember other scripts that I have been thinking of writing. I'm a bit hesitant to add it to puppet directly, I think it would bloat the repo and discovery would be hard. Not that I'm 100% sure but I think it needs way more thinking before jumping to solutioning.

I need a place to run some scripts

I think if it is for mw, mwmaintenance is the right place? If it is for all dbs with root (may be useful in the long term for all apps), probably profile::wmf_root_client aka cumin?

It's a bit offtopic here but the question is that should we run it for non-mw dbs? If so, how the discovery should look like? etc.

Non MW is a different issue and I am not fully sure they should be covered on this to be honest. Some of them highly depend on the application and even if we wanted, we could not change their schemas. So I would leave that aside for o

And another point is that this is not the first script, nor the last. e.g. I want to have drift tracker get automatically updated. I remember other scripts that I have been thinking of writing. I'm a bit hesitant to add it to puppet directly, I think it would bloat the repo and discovery would be hard. Not that I'm 100% sure but I think it needs way more thinking before jumping to solutioning.

Yes, it needs a bit more discussion. I see pros and cons for either mwmaint hosts and cumin hosts, so we'd need to evaluate that once we are in a position to start running this without (much) human intervention.

I ran the script again, so far nothing major has changed. Two notes though:

  • If we want to have alerting, geo_tags will alert for dewiki. Since it's fixed. Maybe we can reset the auto_increment value back to 1? The lowest gt_id there is 13M, we probably can remove that one row (and reinsert) and then the lowest gt_id will be 45M which is quite good.
  • The rate that rc_id, etc. is increasing for wikidatawiki is a bit concerning. We might have to switch them to bigint sooner than expected.