Page MenuHomePhabricator

Audit database usage of GlobalBlocking extension
Closed, ResolvedPublic

Description

A schema change (T307501: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis) on this extension immediately caused a major outage T307647: 2022-05-05 Wikimedia full site outage.

As follow up:

  • Audit the schema
  • Audit the read patterns
  • Audit the write patterns.

Event Timeline

Ladsgroup moved this task from Triage to In progress on the DBA board.

Slow queries of the outage: https://logstash.wikimedia.org/goto/638e9565350a420a7eb8db2ed1a09dcb

The schema

This extension has only one table: globalblocks.

  • Its schema is not optimal, the block reason and actor can be normalized.
    • You could normalize the actor name and comment to the actor id in metawiki but that would couple this database to metawiki's database.
    • You could probably instead normalize to the global user id of the actor in central auth and make comment a set of pre-defined values (1 = 'Open proxy', or something like that)
  • That being said, in total in production it holds only 22MB. Normalizing this table is clearly not worth the work.

Write Patterns

Nothing out of ordinary has been seen in write patterns. There was an issue on purging expired blocks that has been fixed for a while now: T301641: GlobalBlocking purge expired must have a limit

Read Patterns

This is where things get interesting.

  • This table gets queried when any user tries to edit, login, create account, check user contributions, and some more, on any wiki. This is a massive read load and basically bounds this to write scalability of any wiki.
    • I have seen that in some cases, even querying API leads to querying this table. One case being this API call (the log document) which probably needs fixing as it doesn't look like it needs globalblocking data (at first glance at least)
      • Not a major issue though.
    • This would also explain why a small slowdown on this table had such a large-scale domino effect.
  • The natural response to any high read load is to cache (see my phabricator profile picture) but unfortunately I don't think this could work here:
    • By design most of queries to this table yield empty response ("Is this IP globally blocked? No") so you can directly cache the value and has to get creative and e.g. store "miss" (="not blocked") as cache value and actually look it up if it's a "hit"
    • Cache invalidation can get really complex, if you block a range, you have to invalidate "miss" cached values of every IP inside that range. Alternatively you can just avoid invalidating the cache and leave it for rather short ttl but that would seriously undermine our ability to fight LTAs and serious wide-spread vandalism cases.
    • Another problem with caching (specially with short TTL), is that the cache key values (=IPs of people who want to edit, etc.) don't follow the "hot data and long tail" pattern so they won't benefit much from caching in the first place. Sure, we have power users and bots but it's not big enough to make a large-scale difference.
      • We could add range of WMCS network as a configuration of IP ranges that should not be looked up and always get free pass. That would be an easy low hanging fruit.
  • One other way to avoid cases like this in the future is to increase capacity of s7. This won't prevent all issues in case of slowdown of all replicas at the same time (what happened in the outage) but it improves the resilience to absorb the effects of it.
  • Moving centralauth to a dedicated section won't help much either, we don't have a problem with cache locality, I'm sure all of this table has been loaded in the memory of all of s7 dbs and we don't have disk lookups here.

Here is what I got so far, will add more if I find something more.

  • Its schema is not optimal, the block reason and actor can be normalized.
    • You could normalize the actor name and comment to the actor id in metawiki but that would couple this database to metawiki's database.
    • You could probably instead normalize to the global user id of the actor in central auth and make comment a set of pre-defined values (1 = 'Open proxy', or something like that)

Using CA IDs is already a work-in-progress in T299371: Migrate globalblocks table to use central ids instead of usernames. Reasons are being discussed in T243863: Templates used in global block summaries should only reference Meta templates..

Thanks for the analysis @Ladsgroup - very helpful!
Crazy idea: how difficult/impossible would it be for each wiki to have its own blocking table and not having to relay on a global one?

Crazy idea: how difficult/impossible would it be for each wiki to have its own blocking table and not having to relay on a global one?

Unfortunately, we have LTAs who make vandalism, harassment, doxxing and other problematic behavior across many wikis. Wikis having only local blocks is not feasible in that regard.

We could alternatively have a copy of this table per section or wiki but getting those updated is going to be challenging. Not impossible though.

  • I have seen that in some cases, even querying API leads to querying this table. One case being this API call (the log document) which probably needs fixing as it doesn't look like it needs globalblocking data (at first glance at least)

Your example contains &intestactions=edit&intestactionsdetail=full which is a permission check and needs to check blocks, maybe the caller should use quick instead, but that is up to the caller and for what the content is used.

Change 791032 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/GlobalBlocking@master] Add configuration to bypass db queries looking up for block

https://gerrit.wikimedia.org/r/791032

Change 791032 merged by jenkins-bot:

[mediawiki/extensions/GlobalBlocking@master] Add configuration to bypass db queries looking up for block

https://gerrit.wikimedia.org/r/791032

Change 810055 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Set GlobalBlockingAllowedRanges for testwiki

https://gerrit.wikimedia.org/r/810055

Change 810857 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/GlobalBlocking@master] Add statsd metric collection on db calls

https://gerrit.wikimedia.org/r/810857

Change 810857 merged by jenkins-bot:

[mediawiki/extensions/GlobalBlocking@master] Add statsd metric collection on db calls

https://gerrit.wikimedia.org/r/810857

Change 810518 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/GlobalBlocking@wmf/1.39.0-wmf.18] Add statsd metric collection on db calls

https://gerrit.wikimedia.org/r/810518

Change 810518 merged by jenkins-bot:

[mediawiki/extensions/GlobalBlocking@wmf/1.39.0-wmf.18] Add statsd metric collection on db calls

https://gerrit.wikimedia.org/r/810518

Mentioned in SAL (#wikimedia-operations) [2022-07-04T11:55:22Z] <ladsgroup@deploy1002> Synchronized php-1.39.0-wmf.18/extensions/GlobalBlocking/includes/GlobalBlocking.php: Backport: [[gerrit:810518|Add statsd metric collection on db calls (T307648)]] (duration: 03m 26s)

Change 810055 merged by jenkins-bot:

[operations/mediawiki-config@master] Set GlobalBlockingAllowedRanges for testwiki

https://gerrit.wikimedia.org/r/810055

Mentioned in SAL (#wikimedia-operations) [2022-07-04T14:10:42Z] <ladsgroup@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:810055|Set GlobalBlockingAllowedRanges for testwiki (T307648)]] (duration: 03m 39s)

Change 810932 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Excempt WMCS ranges from globalblocking everywhere

https://gerrit.wikimedia.org/r/810932

Change 810932 merged by jenkins-bot:

[operations/mediawiki-config@master] Exempt WMCS ranges from globalblocking everywhere

https://gerrit.wikimedia.org/r/810932

Mentioned in SAL (#wikimedia-operations) [2022-07-04T14:27:16Z] <ladsgroup@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:810932|Exempt WMCS ranges from globalblocking everywhere (T307648)]] (duration: 03m 26s)

So I left it for a day and here is the result

image.png (309×1 px, 67 KB)

At first it removed between 2-3% of the db queries which wasn't a great improvement but it actually slowly got better and better. Now it does two things:

  • It is reducing 10% of db queries all the time. Around 1K per minute
  • It is absorbing most spikes. The spike of extra 3k edits/minute were not felt by the db.

With this, I think we made a rather good improvement to the stability of global blocking and anything else would be too much work for too little gain. If we get another outage, we can revisit and do more work like adding more replicas to s7, etc.