Page MenuHomePhabricator

Move update of category members count to CategoryMembershipChangeJob
Closed, ResolvedPublicPRODUCTION ERROR

Description

Currently, we get a lot of row lock contention and issues afterwards because many edits trigger an update on category table at the same time on the same row:
Here is an example 641 updates happening around the same time: https://logstash.wikimedia.org/goto/aa95c0c4edb48c37c6c72aa065dc1966

It's usually because of mass upload or something like that. The query in question is:

UPDATE  `category` SET cat_pages = cat_pages + 1,cat_files = cat_files + 1 WHERE cat_id = 359162936

and here is the stack trace:

from /srv/mediawiki/php-1.43.0-wmf.5/includes/libs/rdbms/TransactionProfiler.php(525)
#0 /srv/mediawiki/php-1.43.0-wmf.5/includes/libs/rdbms/TransactionProfiler.php(337): Wikimedia\Rdbms\TransactionProfiler->reportExpectationViolated(string, Wikimedia\Rdbms\Query, double, string, string)
#1 /srv/mediawiki/php-1.43.0-wmf.5/includes/libs/rdbms/database/TransactionManager.php(615): Wikimedia\Rdbms\TransactionProfiler->recordQueryCompletion(Wikimedia\Rdbms\Query, double, boolean, integer, string, string)
#2 /srv/mediawiki/php-1.43.0-wmf.5/includes/libs/rdbms/database/Database.php(828): Wikimedia\Rdbms\TransactionManager->recordQueryCompletion(Wikimedia\Rdbms\Query, double, boolean, integer, string)
#3 /srv/mediawiki/php-1.43.0-wmf.5/includes/libs/rdbms/database/Database.php(716): Wikimedia\Rdbms\Database->attemptQuery(Wikimedia\Rdbms\Query, string, boolean)
#4 /srv/mediawiki/php-1.43.0-wmf.5/includes/libs/rdbms/database/Database.php(643): Wikimedia\Rdbms\Database->executeQuery(Wikimedia\Rdbms\Query, string, integer)
#5 /srv/mediawiki/php-1.43.0-wmf.5/includes/libs/rdbms/database/Database.php(1501): Wikimedia\Rdbms\Database->query(Wikimedia\Rdbms\Query, string)
#6 /srv/mediawiki/php-1.43.0-wmf.5/includes/libs/rdbms/database/DBConnRef.php(126): Wikimedia\Rdbms\Database->update(string, array, array, string, array)
#7 /srv/mediawiki/php-1.43.0-wmf.5/includes/libs/rdbms/database/DBConnRef.php(420): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#8 /srv/mediawiki/php-1.43.0-wmf.5/includes/libs/rdbms/querybuilder/UpdateQueryBuilder.php(332): Wikimedia\Rdbms\DBConnRef->update(string, array, array, string, array)
#9 /srv/mediawiki/php-1.43.0-wmf.5/includes/page/WikiPage.php(2924): Wikimedia\Rdbms\UpdateQueryBuilder->execute()
#10 /srv/mediawiki/php-1.43.0-wmf.5/includes/deferred/LinksUpdate/CategoryLinksTable.php(320): WikiPage->updateCategoryCounts(array, array, integer)
#11 /srv/mediawiki/php-1.43.0-wmf.5/includes/deferred/LinksUpdate/CategoryLinksTable.php(287): MediaWiki\Deferred\LinksUpdate\CategoryLinksTable->updateCategoryCounts(array, array)
#12 /srv/mediawiki/php-1.43.0-wmf.5/includes/deferred/LinksUpdate/LinksTable.php(392): MediaWiki\Deferred\LinksUpdate\CategoryLinksTable->finishUpdate()
#13 /srv/mediawiki/php-1.43.0-wmf.5/includes/deferred/LinksUpdate/LinksUpdate.php(225): MediaWiki\Deferred\LinksUpdate\LinksTable->update()
#14 /srv/mediawiki/php-1.43.0-wmf.5/includes/deferred/LinksUpdate/LinksUpdate.php(185): MediaWiki\Deferred\LinksUpdate\LinksUpdate->doIncrementalUpdate()
#15 /srv/mediawiki/php-1.43.0-wmf.5/includes/deferred/DeferredUpdates.php(486): MediaWiki\Deferred\LinksUpdate\LinksUpdate->doUpdate()
#16 /srv/mediawiki/php-1.43.0-wmf.5/includes/deferred/RefreshSecondaryDataUpdate.php(110): MediaWiki\Deferred\DeferredUpdates::attemptUpdate(MediaWiki\Deferred\LinksUpdate\LinksUpdate)
#17 /srv/mediawiki/php-1.43.0-wmf.5/includes/deferred/DeferredUpdates.php(486): RefreshSecondaryDataUpdate->doUpdate()
#18 /srv/mediawiki/php-1.43.0-wmf.5/includes/deferred/DeferredUpdates.php(198): MediaWiki\Deferred\DeferredUpdates::attemptUpdate(RefreshSecondaryDataUpdate)
#19 /srv/mediawiki/php-1.43.0-wmf.5/includes/deferred/DeferredUpdates.php(285): MediaWiki\Deferred\DeferredUpdates::run(RefreshSecondaryDataUpdate)
#20 /srv/mediawiki/php-1.43.0-wmf.5/includes/deferred/DeferredUpdatesScope.php(266): MediaWiki\Deferred\DeferredUpdates::MediaWiki\Deferred\{closure}(RefreshSecondaryDataUpdate, integer)
#21 /srv/mediawiki/php-1.43.0-wmf.5/includes/deferred/DeferredUpdatesScope.php(198): MediaWiki\Deferred\DeferredUpdatesScope->processStageQueue(integer, integer, Closure)
#22 /srv/mediawiki/php-1.43.0-wmf.5/includes/deferred/DeferredUpdates.php(304): MediaWiki\Deferred\DeferredUpdatesScope->processUpdates(integer, Closure)
#23 /srv/mediawiki/php-1.43.0-wmf.5/includes/MediaWikiEntryPoint.php(672): MediaWiki\Deferred\DeferredUpdates::doUpdates()
#24 /srv/mediawiki/php-1.43.0-wmf.5/includes/MediaWikiEntryPoint.php(495): MediaWiki\MediaWikiEntryPoint->restInPeace()
#25 /srv/mediawiki/php-1.43.0-wmf.5/includes/MediaWikiEntryPoint.php(453): MediaWiki\MediaWikiEntryPoint->doPostOutputShutdown()
#26 /srv/mediawiki/php-1.43.0-wmf.5/includes/MediaWikiEntryPoint.php(208): MediaWiki\MediaWikiEntryPoint->postOutputShutdown()
#27 /srv/mediawiki/php-1.43.0-wmf.5/api.php(44): MediaWiki\MediaWikiEntryPoint->run()
#28 /srv/mediawiki/w/api.php(3): require(string)
#29 {main}

It is not really time sensitive to update this counter, I'm not sure why we are updating this in deferred update instead of a job (that we already seem to trigger regardless?). We do also trigger self-heal of that number under many conditions (such as action=purge if number of members is low). Not to mention that this number is not anything critical.

Maybe I'm missing something obvious. Please let me know.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Yeah, this would fix that bug automatically but maybe it's not a good solution. Depending on what mediawiki team thinks about this.

Change #1052305 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Move update of category members count to CategoryMembershipChangeJob

https://gerrit.wikimedia.org/r/1052305

Ladsgroup added a project: DBA.
Ladsgroup moved this task from Triage to In progress on the DBA board.

Change #1052305 abandoned by Ladsgroup:

[mediawiki/core@master] [WIP] Move update of category members count to CategoryMembershipChangeJob

Reason:

I don't have the capacity to work on it right now :(

https://gerrit.wikimedia.org/r/1052305

Ladsgroup removed projects: DBA, Patch-For-Review.

I don't have the capacity to work on it, the cat-a-lot fix removed the immediate problem.

Restricted Application changed the subtype of this task from "Task" to "Production Error". · View Herald TranscriptNov 29 2024, 11:34 AM

It is not really time sensitive to update this counter, I'm not sure why we are updating this in deferred update instead of a job (that we already seem to trigger regardless?). We do also trigger self-heal of that number under many conditions (such as action=purge if number of members is low). Not to mention that this number is not anything critical.

Maybe I'm missing something obvious. Please let me know.

As a user, I appreciate immediately seeing the change in the number after my edit, so I think there is value in it being a deferred update. However, if it threatens the stability of the site, then by all means change this.

Change #1052305 restored by Ladsgroup:

[mediawiki/core@master] [WIP] Move update of category members count to CategoryMembershipChangeJob

https://gerrit.wikimedia.org/r/1052305

Ladsgroup added a project: DBA.

The patch is ready for review.

Change #1052305 merged by jenkins-bot:

[mediawiki/core@master] Move update of category members count to a dedicated job

https://gerrit.wikimedia.org/r/1052305

Tgr subscribed.

I've heard from a few people that they rely on immediate count updates in their workflows, and also that tool-assisted category moves are very slow (which will probably improve now) so this is worth a heads-up.

Tech News text could be something like

When an edit changes the categories of a page, the changes to the category membership counts are now happening asynchronously. This improves the speed of saving edits, especially when moving many pages to/from the same category, and reduces the risk of site outages, but it means that the counts can show outdated information for a few minutes.

Would be nice to find a way that doesn't involve significant delays in category size changes, but that's beyond the scope of this task and maybe worth doing on a more generic level. I filed T402854: Add a high-throughput counter system to MediaWiki.

Change #1181786 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.45.0-wmf.15] Move update of category members count to a dedicated job

https://gerrit.wikimedia.org/r/1181786

Change #1181786 merged by jenkins-bot:

[mediawiki/core@wmf/1.45.0-wmf.15] Move update of category members count to a dedicated job

https://gerrit.wikimedia.org/r/1181786

Mentioned in SAL (#wikimedia-operations) [2025-08-25T21:47:34Z] <ladsgroup@deploy1003> Started scap sync-world: Backport for [[gerrit:1181786|Move update of category members count to a dedicated job (T365303)]]

Mentioned in SAL (#wikimedia-operations) [2025-08-25T21:53:36Z] <ladsgroup@deploy1003> ladsgroup: Backport for [[gerrit:1181786|Move update of category members count to a dedicated job (T365303)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-08-25T22:08:28Z] <ladsgroup@deploy1003> Started scap sync-world: Backport for [[gerrit:1181786|Move update of category members count to a dedicated job (T365303)]]

Mentioned in SAL (#wikimedia-operations) [2025-08-25T22:14:08Z] <ladsgroup@deploy1003> ladsgroup: Backport for [[gerrit:1181786|Move update of category members count to a dedicated job (T365303)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-08-25T22:20:55Z] <ladsgroup@deploy1003> Finished scap sync-world: Backport for [[gerrit:1181786|Move update of category members count to a dedicated job (T365303)]] (duration: 12m 26s)

Ladsgroup reassigned this task from Ladsgroup to Zabe.
Ladsgroup moved this task from In progress to Done on the DBA board.

@Ladsgroup Just to confirm that I understand the effect of the change. As this just moves updating the category counters from deferred updates to jobs, it will smooth out spikes because there is better control over how jobs are executed. Jobs are throttled, they follow replag, and as they are executed asynchronously in the background, they do not block HTTP requests.

However, since the jobs are still per-edit based, category changes will still generate UPDATE category SET cat_pages = cat_pages + 1, cat_files = cat_files + 1 WHERE cat_id = 360156659 for every single change, which will cause row locks, replication lag, and index updates etc. The generated database load itself is still there, but it is just distributed more evenly.

Yeah, that is a very accurate description. I think it's worth emphasizing that the write queries themselves aren't problematic or at least you need a lot more of them to become problematic. The problem is that these writes are competing to hold a lock on the same row, this would force them to be a lot more serialized which has the added benefit of preventing deadlocks (one transaction waiting for lock in some other row, the other row is locked because it's waiting for the lock on the first row).

The next bottleneck here is actually update user_editcount which has the same problem (deferred update, on the same row). I want to eventually move that column out of user table (T237043) which would also to shard them, e.g. each user would have ten rows being picked at random so it they wouldn't compete on lock for same row. If anyone is feeling like doing it, that'll help a lot in both commons and wikidata.

The next bottleneck here is actually update user_editcount which has the same problem (deferred update, on the same row). I want to eventually move that column out of user table (T237043) which would also to shard them, e.g. each user would have ten rows being picked at random so it they wouldn't compete on lock for same row. If anyone is feeling like doing it, that'll help a lot in both commons and wikidata.

T237043: REST API - page history counts - Keep edit counts in separate database table and update on edit doesn't look like quite the right ticket. Neither does T402854: Add a high-throughput counter system to MediaWiki or T352823: Split user table to multiple tables (although that one can be a parent ticket). I went ahead and created T402952: Move user_editcount into its own table, and shard it so that there is a place to hammer out an engineering specification.

Change #1182577 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/core@master] ParserTestRunner: Update category counts for articles

https://gerrit.wikimedia.org/r/1182577

Change #1182577 merged by jenkins-bot:

[mediawiki/core@master] ParserTestRunner: Update category counts for articles

https://gerrit.wikimedia.org/r/1182577

Change #1183632 had a related patch set uploaded (by Ladsgroup; author: Zabe):

[mediawiki/core@wmf/1.45.0-wmf.16] ParserTestRunner: Update category counts for articles

https://gerrit.wikimedia.org/r/1183632

Change #1183632 merged by jenkins-bot:

[mediawiki/core@wmf/1.45.0-wmf.16] ParserTestRunner: Update category counts for articles

https://gerrit.wikimedia.org/r/1183632

Mentioned in SAL (#wikimedia-operations) [2025-09-01T11:03:56Z] <ladsgroup@deploy1003> Started scap sync-world: Backport for [[gerrit:1183632|ParserTestRunner: Update category counts for articles (T365303)]], [[gerrit:1183633|CategoryCacheTest: Update category count]], [[gerrit:1183269|Drop support for categorylinks read old (T299951 T403147 T403337)]]

Mentioned in SAL (#wikimedia-operations) [2025-09-01T11:09:48Z] <ladsgroup@deploy1003> ladsgroup: Backport for [[gerrit:1183632|ParserTestRunner: Update category counts for articles (T365303)]], [[gerrit:1183633|CategoryCacheTest: Update category count]], [[gerrit:1183269|Drop support for categorylinks read old (T299951 T403147 T403337)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-09-01T11:16:24Z] <ladsgroup@deploy1003> Finished scap sync-world: Backport for [[gerrit:1183632|ParserTestRunner: Update category counts for articles (T365303)]], [[gerrit:1183633|CategoryCacheTest: Update category count]], [[gerrit:1183269|Drop support for categorylinks read old (T299951 T403147 T403337)]] (duration: 12m 28s)

@Ladsgroup Another question, did the wait lock time drop to 0 after moving the update to job or do i look the stats from wrong place? If the drop is real it is kind of impressive change.

https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=2025-08-25T14:00:00.000Z&to=2025-09-02T23:59:59.000Z&timezone=utc&var-job=$__all&var-server=db1244&var-port=9104&refresh=1m&viewPanel=panel-24

That confused me too but it's actually unrelated, it was because db1244 stopped being the primary (I did a switchover) and another host became the primary database taking over the locks. They still are and there are still spikes but I haven't seen a major issue so far. I'll try to dig and find the impact but maybe someone else can dig better than I can (and my plate is full these days 😭)

Thanks, I am happy with the answer and no need dig it for me. Is the db1160 the current primary (per T402871)?