Page MenuHomePhabricator

WikiPage::updateCategoryCounts caused 14 minutes of lag on enwiki
Closed, ResolvedPublic

Description

I am not asking for this to be fixed- I am asking if a quick look can be taken to the job queue logic for
WikiPage::updateCategoryCounts

https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?orgId=1&from=1493134211223&to=1493139170151&panelId=1&fullscreen

https://logstash.wikimedia.org/goto/68f5faa6991ecec5f111f9bbabd8aeda

Maybe the code is ok, and it is either a misuse of the trigger for the job, or a one-time overhead, but it caused generalized lag on most of the enwiki slaves. Maybe it just needs to reduce its concurrency or "speed" of updates? I would love if you had time to look at mainly to discard a codfw site or code/config regression.

Event Timeline

jcrespo created this task.Apr 25 2017, 4:57 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 25 2017, 4:57 PM

Oh, I just realized it is the same category or 2 categories enqueued hundreds of times- all trying to update the same row. I wonder why that happened? But that would explain why it created lag- contention on a single row, serialized both on the master and through replication.

Gilles assigned this task to Krinkle.Apr 26 2017, 6:24 PM
Gilles moved this task from Inbox to Doing on the Performance-Team board.
jcrespo added a comment.EditedApr 26 2017, 7:01 PM

I commented on IRC that this may be expected- if a bot changes the category to 1000 articles almost instantly, there will be contention on a single row updating the count. Maybe the functionality should be rethought? As a fast workaround, we can fail faster by reducing the time a row is waited for if it is blocked- but of course that will create bad counts. If this was a core functionality, there are ways to avoid this (sharding or count-caching), but I do not believe such a simple functionality merits that.

https://logstash.wikimedia.org/goto/c4246962b12d5cbfed4285f4ad43b49c

Mostly coming from 10.192.16.4 (db2016). Presumably just due to it being the enwiki (s1) master, and as such having the most traffic.

Log samples:

timestampfnameerrorsql
10:29:35..Lock wait timeout...UPDATE category SET cat_pages = cat_pages - 1 WHERE cat_title IN ('Draft-Class_Norway_articles','Incorrectly_tagged_WikiProject_Biography_articles','Stub-Class_biography_articles','WikiProject_Norway_articles')
10:28:35....UPDATE category SET cat_pages = cat_pages - 1 WHERE cat_title IN ('Draft-Class_Indonesia_articles','Draft-Class_biography_articles','NA-importance_Indonesia_articles','WikiProject_Indonesia_articles')
10:27:09....UPDATE category SET cat_pages = cat_pages - 1 WHERE cat_title = 'Draft-Class_biography_articles'
10:27:08....UPDATE category SET cat_pages = cat_pages - 1 WHERE cat_title = 'Draft-Class_biography_articles'
10:27:08....UPDATE category SET cat_pages = cat_pages - 1 WHERE cat_title = 'Draft-Class_biography_articles'
10:27:08....UPDATE category SET cat_pages = cat_pages - 1 WHERE cat_title = 'Draft-Class_biography_articles'

All have:

  • fname: WikiPage::updateCategoryCounts
  • error: Lock wait timeout exceeded; try restarting transaction
  • url: /rpc/RunJobs.php?wiki=enwiki&type=deleteLinks&maxtime=60&maxmem=300M

Since these come from the Job queue, this is not related to T162121 most likely since that task was about refresh counts that happen during a page view cache miss when a stale value is encountered.

Also the query is different. Category::refreshCounts (from T162121) essentially does SELECT COUNT(*) followed by SET cat_pages = x. It also does it in a deferred update, so we already display use the stale value as-is for the current request. This made it very easy to fix the congestion by simply using a lock key and just not doing the update if another request is already going to do it.

The queries from WikiPage::updateCategoryCounts that this task is about are quite different. They perform a relative update, which cannot be as easily de-duplicated at run time.

We should figure out the following:

  1. Are these queries genuine?

In other words, is it really by design that there are this many jobs scheduled and executing at the same time that all perform the same decrement/increment for the same category name? Perhaps this is because the category being removed en-mass from lots of different articles. Possibly due to it being removed by a bot (e.g. maybe the category got renamed, merged, split or otherwise became obsolete. Or because it was added through a template that now no longer has is. Whatever the case, it seems like the change itself is genuine. E.g. we're not talking about duplicate jobs for the same change, these are separate changes that should in fact all be applied.

  1. How to run them properly?

Do we have other jobs like this? How do we schedule those in a way that naturally waits for the previous one to complete.

  • Does the job queue have a built-in mechanism for this? (e.g. something to indicate that only one of these should run at a given time).
  • Or would it make sense to merge these jobs in a way that changes it from -1 to -2? (de-duplicate with combining of data)
  • Or would a lock suffice so that the job idles until the previous one completes (if it is quick enough).

The latter seems what we usually do, but it seems we already have locks in place and they are timing out. Which locks are that? Perhaps they are spanning too much code in a single lock? This seems possible given that these are not dedicated "category count" jobs, but rather, they come from refreshLinks jobs, which is a much larger job. If we have a single lock for the entire LinkUpdate of one page, then I can imagine the following scenario:

  • Template X used by 100+ articles is updated. Part of its change is the addition or removal of a category.
  • MediaWiki spawns recursive update job for Template X.
  • Job runners start running. We naturally deduplicate the jobs for the 100+ articles with any other jobs for the same articles. But that doesn't help us because the jobs are about the articles, not the categories. They merely happen to share a common category.
  • Job runners are running page 1-10 concurrently.
    • Each LinkUpdate acquires a lock by page id (LinkUpdate::acquirePageLock), which should work fine.
    • Each LinkUpdate eventually reaches updateCategoryCounts(). At this point each has to wait for the previous ones to complete first. This simple SET query doesn't strike me as slow, but I guess the reason the later ones are timeout is due to the sheer number of them. Each has to run serially and it is waiting natively in MySQL without MediaWiki PHP orchestrating it in a better way.

Does that sound right @aaron?

Not totally on-topic, but wikidata team used to use the wikidata database master (s5) to coordinate updates to the same rows (by setting a lock), but I asked to stop doing that because they adquired very long open connections to a SPOF. They have been working since then on a redis-based locking for change dispatching- probably not useful for this, but I wanted to mention it: T151681. Feel free to ignore this comment.

This not an unbreak now for me, knowing it is not codfw-switchover related (which was my biggest fear).

aaron added a comment.Apr 27 2017, 5:29 AM

The various commitAndWaitForReplication() calls in LinksUpdate will release row locks throughout the job. Also, updateCategoryCounts() occurs near the end of the job and is normally only looking at the category table (not categorylinks); hooks might change that, as well as updateCategoryCounts() triggering refreshCounts() in some cases (via the LOCK IN SHARE MODE locks on categorylinks, though that case should be quick).

An initial commit in updateCategoryCounts() couldn't hurt though.

Change 350509 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Release prior row locks beforehand in LinksUpdate::updateCategoryCounts

https://gerrit.wikimedia.org/r/350509

Change 350641 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.29.0-wmf.21] Release prior row locks beforehand in LinksUpdate::updateCategoryCounts

https://gerrit.wikimedia.org/r/350641

Change 350509 merged by jenkins-bot:
[mediawiki/core@master] Release prior row locks beforehand in LinksUpdate::updateCategoryCounts

https://gerrit.wikimedia.org/r/350509

Change 350641 merged by jenkins-bot:
[mediawiki/core@wmf/1.29.0-wmf.21] Release prior row locks beforehand in LinksUpdate::updateCategoryCounts

https://gerrit.wikimedia.org/r/350641

Mentioned in SAL (#wikimedia-operations) [2017-04-27T23:16:58Z] <catrope@naos> Synchronized php-1.29.0-wmf.21/includes/deferred/LinksUpdate.php: Release prior row locks beforehand in LinksUpdate::updateCategoryCounts (T163801) (duration: 01m 01s)

Change 351014 had a related patch set uploaded (by Paladox; owner: Aaron Schulz):
[mediawiki/core@REL1_29] Release prior row locks beforehand in LinksUpdate::updateCategoryCounts

https://gerrit.wikimedia.org/r/351014

Change 351014 abandoned by Krinkle:
Release prior row locks beforehand in LinksUpdate::updateCategoryCounts

https://gerrit.wikimedia.org/r/351014