Page MenuHomePhabricator

WikiPage::updateCategoryCounts causing replication lag due to long-running writes on commonswiki
Closed, ResolvedPublic

Description

s4 today was doing ~6.5M rps of reads for ~20 minutes, causing database lag and "database locked" errors on attempting to use uploadwizard.

The query at fault looks to be:

UPDATE /* WikiPage::updateCategoryCounts */ `category` SET cat_pages = cat_pages - 1, cat_subcats = cat_subcats - 1 WHERE cat_title = 11

This should probably instead be WHERE cat_title = '11'. This bug is possibly similar to T239877 ?

https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1576015480395&to=1576018128623&fullscreen&panelId=8

https://tendril.wikimedia.org/report/slow_queries?host=%5Edb&user=wikiuser&schema=wik&qmode=eq&query=&hours=.5

https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=d7b5c8b8f1f86eee492b22926627f0b2&host=%5Edb&user=wikiuser&schema=wik&hours=.5

Event Timeline

CDanis created this task.Dec 11 2019, 12:24 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 11 2019, 12:24 AM
Reedy updated the task description. (Show Details)Dec 11 2019, 12:28 AM
Reedy updated the task description. (Show Details)
Jdforrester-WMF triaged this task as High priority.Dec 11 2019, 12:35 AM
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

This is at least High and possibly UBN. Had the train followed the normal cadence we'd probably have declared this a train blocker…

Marostegui raised the priority of this task from High to Unbreak Now!.Dec 11 2019, 5:50 AM
Marostegui removed a project: DBA.
Marostegui added a subscriber: Marostegui.

I think this is definitely a UBN

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptDec 11 2019, 5:50 AM

Please note the tendril results are rolling (the .5 is a relative range that is now meaningless). There was not much information there except what is already written above, with up to 40 simultaneous queries like the one mentioned taking up to 22 seconds (with an average of 10 seconds) causing replication lag, and some extra ones blocked on "waiting on gtid", as expected due to lag.

jcrespo renamed this task from WikiPage::updateCategoryCounts putting heavy load on commonswiki to WikiPage::updateCategoryCounts causing replication lag due to long-running writes on commonswiki.Dec 11 2019, 12:45 PM
jcrespo added a comment.EditedDec 11 2019, 12:48 PM

updating title as load was not really high, in fact there was lower load than normal, the problem was the long running writes due to reading too many rows and causing replication lag, which blocks many other queries, and things go to read only, etc. https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s4&var-role=All&from=1576014107691&to=1576020151110

See:

root@db1138.eqiad.wmnet[commonswiki]> EXPLAIN UPDATE /* WikiPage::updateCategoryCounts */ `category` SET cat_pages = cat_pages - 1, cat_subcats = cat_subcats - 1 WHERE cat_title = 11\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: category
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 9009575
        Extra: Using where
1 row in set (0.04 sec)

vs

root@db1138.eqiad.wmnet[commonswiki]> EXPLAIN UPDATE /* WikiPage::updateCategoryCounts */ `category` SET cat_pages = cat_pages - 1, cat_subcats = cat_subcats - 1 WHERE cat_title = '11'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: category
         type: range
possible_keys: cat_title
          key: cat_title
      key_len: 257
          ref: NULL
         rows: 1
        Extra: Using where
1 row in set (0.04 sec)
Anomie added a subscriber: Anomie.Dec 11 2019, 2:45 PM

I think that rMW8acea5491dce: Handle database error with LinksUpdate and numeric category names fixes this. That patch is in wmf.10, but wasn't backported to wmf.8.

I note this bug report is about s4, which is commonswiki and testcommonswiki. Since the latter does not have a Category:11 while the former does, it's probably the former. SAL indicates that commonswiki has not yet been moved on from wmf.8.

I wish we could have MariaDB just throw an error in type-mismatch situations like this, instead of being dumb in various ways.

jcrespo added a comment.EditedDec 11 2019, 3:15 PM

@Anomie I can confim it was commonswiki, that was what the monitoring output.

Offtopic: Sorry for the non-persistent statistics, there are plans to offer those but we are blocked on feedback at T195578 to offer better tooling.

I wish we could have MariaDB just throw an error in type-mismatch situations like this

I believe some later versions or alternative vendors may offer better config, as well as enabling a stricter sql_mode T108255 (I have not checked both cases are true for this specific case). Sadly, the push I started for being more strict-standard on that ticket was not well received by some people. I would greatly thank some allies both on technology and product on this push against more lazy, but also looser standards, to prevent and catch problems earlier.

PS: I actually checked and newest versions/strict mode produces at the very least a WARNING:

root@db1138.eqiad.wmnet[commonswiki]> set sql_mode='TRADITIONAL';
Query OK, 0 rows affected (0.04 sec)

root@db1138.eqiad.wmnet[commonswiki]> select * FROM category WHERE cat_title = 11;
...
1949 rows in set, 65535 warnings (4.55 sec)

root@db1138.eqiad.wmnet[commonswiki]> show warnings;
+---------+------+-----------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                       |
+---------+------+-----------------------------------------------------------------------------------------------+
| Warning | 1292 | Truncated incorrect DOUBLE value: 'PD_US_Government'                                          |

I believe it also creates (non-irreparable) data loss.

jcrespo moved this task from Backlog to Acknowledged on the Operations board.Dec 11 2019, 5:33 PM

Production Commons is now running wmf.10 (as of 23:19 UTC, and also briefly from 20:1121:17); did this get fixed by wmf.10?

Note the issue will only happen on a very specific query with specific parameters, if someone knows how to trigger its code, please try and share results here. :-)

The other pending issue to engage after confirmed fixed but before closing this, is to fix the category count of a subset of commons categories (I believe those that start with '11...'?, but I am not 100% sure and I haven't looked closely into it (is there a "manual category recount/check script"?).

Note the issue will only happen on a very specific query with specific parameters, if someone knows how to trigger its code, please try and share results here. :-)

I believe that editing a page to either add or remove [[Category:11]] will do it.

I believe those that start with '11...'?

For this query, yes. Was "cat_title = 11" the only title that appeared in the logs?

is there a "manual category recount/check script"?

maintenance/recountCategories.php exists, but looks like it could use a little work:

  • Ability to recount all three counts
  • Ability to supply a list of categories or a category name prefix.

maintenance/populateCategories.php also exists. This does recount all three counts, but still lacks the ability to supply a list or prefix.

Note also bugs like T221795 and T224321 and the various bugs in their "mentions" lists. Category counts in general seem to be a bit flaky.

Was "cat_title = 11" the only title that appeared in the logs?

It was the most frequent observed, but I didn't check every appearance because the tendril logs are just samples. There could be a more complete record on slow log from mediawiki. Sadly, even beyond those, there could be additional queries that got hit with the issue that didn't happen to be slow.

I can propose you to check that query pattern on logs in the last 30 days through the binlog, but that will take some time. I have 2 counter-questions:

  • Are you aware of a starting point in time of when this could have started (if relatively short, to trim the search)? Also could other wikis be affected?
  • If not a short period (based on your answer), maybe it would be easier to check and/or reparse every category starting with a numeral? (which would not be a short list, but easier and shorter than all categories). I am not asking you to do that (I can work on this), just asking your thoughts.

Category counts in general seem to be a bit flaky.

Understood, I am not too worried because this is non-canonical data, I just want to detect/mitigate the largest issue with a 99% effectiveness in a fast way, even if that means doing it at "DBA layer" rather than "MW layer" :-D.

I'd also prefer to focus long term efforts on prevention of larger issues in the future rather than fixing 0:-)

  • Are you aware of a starting point in time of when this could have started (if relatively short, to trim the search)?

It was only present in wmf.8; wmf.5 and earlier were not affected, and wmf.10 had the fix. So it looks like (all times UTC):

  • testwiki: 2019-11-26 19:06 to 2019-12-10 21:16
  • rest of group 0: 2019-11-26 19:46 to 2019-12-10 21:23
  • group 1: 2019-12-04 20:12 to 2019-12-04 23:38, 2019-12-09 17:08 to 2019-12-11 23:18
  • group 2: 2019-12-09 17:52 to 2019-12-12 20:18

Also could other wikis be affected?

Yes.

  • If not a short period (based on your answer), maybe it would be easier to check and/or reparse every category starting with a numeral? (which would not be a short list, but easier and shorter than all categories). I am not asking you to do that (I can work on this), just asking your thoughts.

Note that if "cat_title = 0" was ever run on a wiki, then every category not starting with a numeral would also have been affected.

Nothing here seems hard as far as the recounting goes, just potentially time-consuming. I could easily kick off a recount of every category on every wiki as requested in T224321, it'd just take a while (weeks, probably) to run on the bigger ones.

I'd also prefer to focus long term efforts on prevention of larger issues in the future rather than fixing 0:-)

T221795 recommends recounting a category on purge of the category page (via the job queue, possibly limited to wiki-admins), so wiki users can self-serve fix individual categories as needed. That seems the most straightforward long-term solution.

jcrespo lowered the priority of this task from Unbreak Now! to Medium.Dec 13 2019, 5:31 PM

I believe that editing a page to either add or remove [[Category:11]] will do it.

I did https://commons.wikimedia.org/w/index.php?title=User:JCrespo_(WMF)/test&diff=380126204&oldid=368216289 and https://commons.wikimedia.org/w/index.php?title=User:JCrespo_(WMF)/test&diff=380126482&oldid=380126204

No lag, and I can confirm on binlog queries are as expected, a few seconds after the edits, on master's binlog:

UPDATE /* WikiPage::updateCategoryCounts  */  `category` SET cat_pages = cat_pages + 1 WHERE cat_title = '11'
UPDATE /* WikiPage::updateCategoryCounts  */  `category` SET cat_pages = cat_pages - 1 WHERE cat_title = '11'

Leaving open for the mitigations mentioned at T240405#5739756

jcrespo added a comment.EditedDec 13 2019, 5:34 PM

I like T221795 as a follow up, then, if we can increase it from low to medium ("will be done, just not yet scheduled") + help increasing importance of strict sql_mode T108255, we can close this and assign this to the person that did the patch.

I've created https://wikitech.wikimedia.org/wiki/Incident_documentation/20191210-updateCategoryCounts to document the issue, I think the followups are more important than the actual impact. I am probably biased on my first draft, please help me complete it.

Krinkle added a subscriber: Krinkle.

(re-tagging since it's not a bug or feature request for the rdbms lib itself.)

As of today, there are 3,082 categories in [[:commons:Category:Non-empty category redirects]], which is populated by a template that uses {{PAGESINCAT}} to determine whether a category is empty. The vast majority of these categories (a) have a title that starts with a digit and (b) are actually empty, even though {{PAGESINCAT}} returns a non-zero value for them. (There are a couple of hundred categories whose names don't start with a digit, but most of these appear to be legitimately non-empty.) I can't help but wonder whether this has something to do with the non-quoted category title issue identified by CDanis in the original post ("This should probably instead be WHERE cat_title = '11'.")

I can't help but wonder whether this has something to do with the non-quoted category title issue identified by CDanis in the original post ("This should probably instead be WHERE cat_title = '11'.")

Unlikely. It's much more likely to be related to T221795.

CCicalese_WMF closed this task as Resolved.Apr 28 2020, 8:49 PM
CCicalese_WMF claimed this task.
CCicalese_WMF added a subscriber: CCicalese_WMF.

Marking as Resolved as it is in the Done column. Feel free to reopen if there is remaining work.

@CCicalese_WMF: Does that mean this issue is resolved? I'm asking as I don't see how, and T221795 is also still open...

Per T240405#5746932, this task resulted in the creation of documentation. T221795 was created as a follow up task but is not a blocker to resolving this task as far as we understand it.

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:41 PM