Page MenuHomePhabricator

[WB-Client] [TECH] Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages Deadlock
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error message
Function: Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages
Error: 1213 Deadlock found when trying to get lock; try restarting transaction
Stack Trace
	#0 /srv/mediawiki/php-1.35.0-wmf.37/includes/libs/rdbms/database/Database.php(1663): Wikimedia\Rdbms\Database->getQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.35.0-wmf.37/includes/libs/rdbms/database/Database.php(1640): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, integer, string, string)
#2 /srv/mediawiki/php-1.35.0-wmf.37/includes/libs/rdbms/database/Database.php(1215): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#3 /srv/mediawiki/php-1.35.0-wmf.37/includes/libs/rdbms/database/Database.php(2319): Wikimedia\Rdbms\Database->query(string, string, integer)
#4 /srv/mediawiki/php-1.35.0-wmf.37/includes/libs/rdbms/database/Database.php(2281): Wikimedia\Rdbms\Database->doInsertNonConflicting(string, array, string)
#5 /srv/mediawiki/php-1.35.0-wmf.37/extensions/Wikibase/client/includes/Usage/Sql/EntityUsageTable.php(189): Wikimedia\Rdbms\Database->insert(string, array, string, array)
#6 /srv/mediawiki/php-1.35.0-wmf.37/extensions/Wikibase/client/includes/Usage/Sql/SqlUsageTracker.php(184): Wikibase\Client\Usage\Sql\EntityUsageTable->addUsages(integer, array)
#7 /srv/mediawiki/php-1.35.0-wmf.37/extensions/Wikibase/client/includes/Store/UsageUpdater.php(82): Wikibase\Client\Usage\Sql\SqlUsageTracker->addUsedEntities(integer, array)
#8 /srv/mediawiki/php-1.35.0-wmf.37/extensions/Wikibase/client/includes/Store/AddUsagesForPageJob.php(122): Wikibase\Client\Store\UsageUpdater->addUsagesForPage(integer, array)
#9 /srv/mediawiki/php-1.35.0-wmf.37/extensions/EventBus/includes/JobExecutor.php(79): Wikibase\Client\Store\AddUsagesForPageJob->run()
#10 /srv/mediawiki/rpc/RunSingleJob.php(76): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#11 {main}
Impact
    1. Notes
  • Happening intermittently with low frequency.

Details

Request ID
a95f3d48-2165-4e0f-8060-013b24f55b47
Request URL
/rpc/RunSingleJob.php
Stack Trace

#0 /srv/mediawiki/php-1.35.0-wmf.37/includes/libs/rdbms/database/Database.php(1663): Wikimedia\Rdbms\Database->getQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.35.0-wmf.37/includes/libs/rdbms/database/Database.php(1640): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, integer, string, string)
#2 /srv/mediawiki/php-1.35.0-wmf.37/includes/libs/rdbms/database/Database.php(1215): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#3 /srv/mediawiki/php-1.35.0-wmf.37/includes/libs/rdbms/database/Database.php(2319): Wikimedia\Rdbms\Database->query(string, string, integer)
#4 /srv/mediawiki/php-1.35.0-wmf.37/includes/libs/rdbms/database/Database.php(2281): Wikimedia\Rdbms\Database->doInsertNonConflicting(string, array, string)
#5 /srv/mediawiki/php-1.35.0-wmf.37/extensions/Wikibase/client/includes/Usage/Sql/EntityUsageTable.php(189): Wikimedia\Rdbms\Database->insert(string, array, string, array)
#6 /srv/mediawiki/php-1.35.0-wmf.37/extensions/Wikibase/client/includes/Usage/Sql/SqlUsageTracker.php(184): Wikibase\Client\Usage\Sql\EntityUsageTable->addUsages(integer, array)
#7 /srv/mediawiki/php-1.35.0-wmf.37/extensions/Wikibase/client/includes/Store/UsageUpdater.php(82): Wikibase\Client\Usage\Sql\SqlUsageTracker->addUsedEntities(integer, array)
#8 /srv/mediawiki/php-1.35.0-wmf.37/extensions/Wikibase/client/includes/Store/AddUsagesForPageJob.php(122): Wikibase\Client\Store\UsageUpdater->addUsagesForPage(integer, array)
#9 /srv/mediawiki/php-1.35.0-wmf.37/extensions/EventBus/includes/JobExecutor.php(79): Wikibase\Client\Store\AddUsagesForPageJob->run()
#10 /srv/mediawiki/rpc/RunSingleJob.php(76): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#11 {main}

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Task Prio Notes:

  • Does affect end users / production
  • Does not affect monitoring
  • Does not affect development efforts
  • Might affect onboarding efforts (if we include documentation improvements as an A/C of this task)
  • Does affect additional stakeholders (Logspam)
ItamarWMDE renamed this task from [WB-Client] [TECH] Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages Deadlock to [SW] [WB-Client] [TECH] Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages Deadlock.Sep 26 2023, 2:02 PM

I noticed that addUsages() uses $this->db->replication()->wait(); which uses LBFactory::waitForReplication(). Doesn't that mean it's waiting for replicas without committing each batch? It seems like it would just hold more and more locks while waiting for unrelated events to replicate. Maybe AddUsagesForPageJob could use self::JOB_NO_EXPLICIT_TRX_ROUND (like RefreshLinksJob) and LBFactory::commitAndWaitForReplication() could be used instead. Perhaps addUsagesForPage() could wrap the update in a named lock based on the page ID, similar to LinksUpdate::acquirePageLock. Maybe addUsedEntities() could also do a SELECT first to see which rows are already there.

I see the job spec for AddUsagesForPageJob has "removeDuplicates" but the parameters includes "usages". Maybe successive edits result in slight differences in params that bypass de-duplication?

I noticed that addUsages() uses $this->db->replication()->wait(); which uses LBFactory::waitForReplication(). Doesn't that mean it's waiting for replicas without committing each batch? It seems like it would just hold more and more locks while waiting for unrelated events to replicate.

Hm, that sounds like a good point, but I’m skeptical – could we really have missed this when the waiting was introduced and in the five years since then? (Though admittedly this task is also three years old already.) Shouldn’t waiting for replication in an uncommitted transaction cause visible warnings or something?

The waiting was added, with explicit startAtomic()+endAtomic(), in Make batches in EntityUsageTable::addUsages atomic, to address T192349: deadlocks on INSERT IGNORE INTO wbc_entity_usage – and even though AFAICT jobs already ran with an outer transaction at the time, and usages were also already added via jobs at the time, this apparently still helped with the deadlocks, somehow. (Not long after, EntityUsageTable: Remove useless atomic section removed the startAtomic()+endAtomic() calls.)

There’s also this:

SqlUsageTracker::addUsedEntities()
		// NOTE: while logically we'd like the below to be atomic, we don't wrap it in a
		// transaction to prevent long lock retention during big updates.
		$db = $this->connectionManager->getWriteConnection();
		$usageTable = $this->newUsageTable( $db );

This sounds a lot like the code is expecting to run without a surrounding transaction…

AFAICT, “Job::run() has outer transaction scope” currently means “it is called after a transaction has been started (unless the JOB_NO_EXPLICIT_TRX_ROUND flag is set), but the transaction is ‘registered’ to the job’s __METHOD__, so the job is free to commit this transaction or start other ones as much as it wants” – and I wonder if this was at some point mistaken for “the job has no open transaction at all unless it starts one”? (That’s what I first assumed “outer transaction scope” meant, at least.)

Please take everything I write here with a grain of salt, I'm really not firm with databases. This is mainly me thinking aloud.

Reading the section in the wiki about transaction scope was helpful for me to understand the mediawiki specific context: https://www.mediawiki.org/wiki/Database_transactions#Transaction_scope

From reading the comments on the related changes on Gerrit and linked Phabricator tasks, it seems to me that the current code expects to not run in a transaction at all and to directly write to the database and then wait for replication after each batch.

And AddUsagesForPageJob seems to indeed not set the JOB_NO_EXPLICIT_TRX_ROUND flag.

So, what @aaron writes seems to be good avenues to explore.

I might have another one:

\Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages
		$writeConnection = $this->getWriteConnection();
		foreach ( $batches as $rows ) {
			$writeConnection->newInsertQueryBuilder()
				->insertInto( $this->tableName )
				->ignore()
				->rows( $rows )
				->caller( __METHOD__ )->execute();
			$c += $writeConnection->affectedRows();

			// Wait for all database replicas to be updated, but only for the affected client wiki.
			$this->db->replication()->wait();
		}

The waiting seems to happen between batches, but is it also important to happen after the last batch? Or, if there is only one batch, then must it also happen after that one batch?

If there are many pages that have only a few usages that do not fill a whole badge, then that might lead to a lot of needless waiting in transactions. That might amount to more than the actual transactions with multiple batches?

Also, I noticed @hoo in many of the changes touching this code.

I still have to understand the implications of named locks better.
Also for the suggestion of checking what rows already exist needs more thinking. I could see that making a difference if we have many pages with a lot of usages that rarely change, and so pruning them might reduce the number of batches.

ItamarWMDE renamed this task from [SW] [WB-Client] [TECH] Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages Deadlock to [WB-Client] [TECH] Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages Deadlock.Oct 11 2023, 10:39 AM

The waiting was added, with explicit startAtomic()+endAtomic(), in Make batches in EntityUsageTable::addUsages atomic, to address T192349: deadlocks on INSERT IGNORE INTO wbc_entity_usage – and even though AFAICT jobs already ran with an outer transaction at the time, and usages were also already added via jobs at the time, this apparently still helped with the deadlocks, somehow. (Not long after, EntityUsageTable: Remove useless atomic section removed the startAtomic()+endAtomic() calls.)

@Ladsgroup and @hoo, can you shed any light on this?

Since jobs also have implicit transaction, the start and end atomic doesn't matter much (I know...)

You could potentially:

  • make a TRX_AUTOCOMMIT connection or
  • remove the waitForReplication between each batch or
  • make the job, queue another job instead of doing batches all in one job (the refreshlinks way).

I have removed a log filter from Kibana MediaWiki New Errors dashboard which pointed back to this task. There is no hit over the last eight weeks.

The filter was T255706 EntityUsageTable::addUsages deadlock:

{
  "query": {
    "match_phrase": {
      "labels.normalized_message": "(*EntityUsageTable::addUsages*)"
    }
  }
}
  • remove the waitForReplication between each batch or
  • make the job, queue another job instead of doing batches all in one job (the refreshlinks way).

I think I’d go for both of these. In EntityUsageTable, remove the wait for replication; but in the job, only send a limited number of rows to EntityUsageTable to begin with and then queue another job for the rest.

@Michael, @ArthurTaylor and I looked into this issue some more and started noticing some things that haven’t been discussed here yet AFAICT:

  • The queries that deadlock always try to insert a bunch of C.P% entity usages, for the same entity, on the same page. Example:
INSERT IGNORE INTO `wbc_entity_usage` (eu_page_id,eu_aspect,eu_entity_id) VALUES (1918734,'C.P486','Q1059818'),(1918734,'C.P4394','Q1059818'),(1918734,'C.P696','Q1059818'),(1918734,'C.P1323','Q1059818'),(1918734,'C.P1693','Q1059818'),(1918734,'C.P1694','Q1059818'),(1918734,'C.P1402','Q1059818'),(1918734,'C.P7173','Q1059818')
  • (cont.) – that’s on enwiki, Dec 15, 2023 @ 05:53:37.303, reqId 55f9a1a6-f93e-440f-ae08-ad1ac111ba27; all the values are for page ID 1918734 and entity ID Q1059818.
  • The affected pages (e.g. page ID 1918734 on enwiki) always have both a C usage and these C.P% usages for the same entity. Semantically, this shouldn’t happen – “C” covers all C.P% – but there might be some logic error in the code.
    • Using the analytics replicas, and preferably only on a small wiki, you can count such pages using this query: SELECT COUNT(DISTINCT eu1.eu_row_id) FROM wbc_entity_usage eu1 JOIN wbc_entity_usage eu2 ON eu1.eu_entity_id = eu2.eu_entity_id AND eu1.eu_page_id = eu2.eu_page_id WHERE eu1.eu_aspect = 'C' AND eu2.eu_aspect LIKE 'C.%';
  • The C.P% rows apparently get re-inserted when the affected page is purged via the API (with forced links update) – observe the different eu_row_id when the below query is run for the second time:
mysql:research@dbstore1003.eqiad.wmnet [enwiki]> SELECT * FROM wbc_entity_usage WHERE eu_page_id = 64091671 AND eu_entity_id = 'Q94595271';
+-----------+--------------+-----------+------------+
| eu_row_id | eu_entity_id | eu_aspect | eu_page_id |
+-----------+--------------+-----------+------------+
|  68680169 | Q94595271    | C         |   64091671 |
| 297239401 | Q94595271    | C.P18     |   64091671 |
| 297239402 | Q94595271    | C.P1960   |   64091671 |
| 297239403 | Q94595271    | C.P2038   |   64091671 |
| 297239399 | Q94595271    | C.P31     |   64091671 |
| 297239400 | Q94595271    | C.P569    |   64091671 |
|  70607769 | Q94595271    | D.en      |   64091671 |
|  68680144 | Q94595271    | O         |   64091671 |
|  68680148 | Q94595271    | S         |   64091671 |
|  68680168 | Q94595271    | T         |   64091671 |
+-----------+--------------+-----------+------------+
10 rows in set (0.001 sec)

mysql:research@dbstore1003.eqiad.wmnet [enwiki]> SELECT * FROM wbc_entity_usage WHERE eu_page_id = 64091671 AND eu_entity_id = 'Q94595271';
+-----------+--------------+-----------+------------+
| eu_row_id | eu_entity_id | eu_aspect | eu_page_id |
+-----------+--------------+-----------+------------+
|  68680169 | Q94595271    | C         |   64091671 |
| 297239474 | Q94595271    | C.P18     |   64091671 |
| 297239475 | Q94595271    | C.P1960   |   64091671 |
| 297239476 | Q94595271    | C.P2038   |   64091671 |
| 297239472 | Q94595271    | C.P31     |   64091671 |
| 297239473 | Q94595271    | C.P569    |   64091671 |
|  70607769 | Q94595271    | D.en      |   64091671 |
|  68680144 | Q94595271    | O         |   64091671 |
|  68680148 | Q94595271    | S         |   64091671 |
|  68680168 | Q94595271    | T         |   64091671 |
+-----------+--------------+-----------+------------+
10 rows in set (0.001 sec)
  • The C row, on the other hand, kept its old row ID. But really, it should have been removed.

We haven’t managed to reproduce the issue elsewhere yet. Locally, an extra C usage seems to be removed correctly when the page isn’t actually above the entityUsageModifierLimits, and changing those limits seems to work as we would expect. On Beta, pages with both C and C.P% usages seem to get their C.P% usages removed instead – we’re not sure if this is because the affected Beta pages (there are only about a dozen of them) genuinely use enough different statements to exceed the limit, or if some other bug is going on. (The connected items don’t have that many statements, but that doesn’t mean much – statement usage is still tracked even if no such statement exists. So it’s possible that the affected Beta pages use some templates and modules imported from enwiki that genuinely ask for tons of statements and therefore exceed the limit. But we’re not sure. A previously affected Beta page, which now only has the C usage left after we issued a purge, is Berlin.)

Relevant places in the code include DataUpdateHookHandler::doLinksUpdateComplete() and DataUpdateHookHandler::onParserCacheSaveComplete(); the former replaces all entity usages after a proper edit, while the latter only adds extra usages (e.g. when a page is viewed in a different language on a multilingual wiki like Commons). The latter is also what pushes the job that we see in this task (AddUsagesForPageJob); the former doesn’t involve jobs.

Notably, it seems that onParserCacheSaveComplete() can be called twice per edit, even locally. It’s also not clear how its calls relate to the call of doLinksUpdateComplete(), timing-wise. The multiple calls might be a bug somewhere. (Redundant parse? cf. T288707)

Given that we don’t really understand the deadlock yet, we’ll see if we can fix this logic error before trying the deadlock improvements from T255706#9302521.

Change 984136 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Add debug code for entity usage logic issue

https://gerrit.wikimedia.org/r/984136

We @Michael, @Lucas_Werkmeister_WMDE continued our investigations.

There are two apparently related issues in production:

  • There are pages that have both specific C.XXX statements and a generic C statement.
    • When we purge these pages (action=purge), both are still present
    • When we make a null edit (action=submit), we see that the C.XXX rows are all regenerated (but not the C statement)
  • Locally we are unable to reproduce either issue
    • We had previously been able to observe double invocations of the DataUpdateHookHandler::onParserCacheSaveComplete function after page purges - today we have not been able to reproduce that.

Investigation steps:

  • We added a debug usages function EntityUsagesTable::tmpDebugUsages to warn in the case that the new list of entities contains both C and C.XXX statements.
    • Called the function from SqlUsageTracker::replaceEntityUsages and SqlUsageTracker::addUsages
    • Hope to identify if the writes of the C and C.XXX statements in production are coming from a single process or (by elimination) from two separate processes.
  • We temporarily deployed this patch to one of the debug servers (mwdebug1001) and attempted to reproduce the issue.
    • We were able to observe (in two separate SQL requests) that the C.XXX statements were completely absent and then were inserted again
    • In this process, no log warnings were emitted.
    • We are not able to exclude that the mwdebug1001 server was not correctly sending log events.

We were able to observe (in two separate SQL requests) that the C.XXX statements were completely absent and then were inserted again

Specifically, we saw for eowiki:Perseo (arbitrary test page):

mysql:research@dbstore1007.eqiad.wmnet [eowiki]> SELECT * FROM wbc_entity_usage WHERE eu_page_id = 115136 AND eu_entity_id = 'Q130832';
+-----------+--------------+-----------+------------+
| eu_row_id | eu_entity_id | eu_aspect | eu_page_id |
+-----------+--------------+-----------+------------+
|   9821865 | Q130832      | C         |     115136 |
|    506488 | Q130832      | S         |     115136 |
|   7737086 | Q130832      | T         |     115136 |
+-----------+--------------+-----------+------------+
3 rows in set (0.001 sec)

mysql:research@dbstore1007.eqiad.wmnet [eowiki]> SELECT * FROM wbc_entity_usage WHERE eu_page_id = 115136 AND eu_entity_id = 'Q130832';
+-----------+--------------+-----------+------------+
| eu_row_id | eu_entity_id | eu_aspect | eu_page_id |
+-----------+--------------+-----------+------------+
|   9821865 | Q130832      | C         |     115136 |
|  13242857 | Q130832      | C.P1015   |     115136 |
|  13242858 | Q130832      | C.P1053   |     115136 |
|  13242832 | Q130832      | C.P1559   |     115136 |
|  13242833 | Q130832      | C.P21     |     115136 |
|  13242847 | Q130832      | C.P213    |     115136 |
|  13242846 | Q130832      | C.P214    |     115136 |
|  13242836 | Q130832      | C.P22     |     115136 |
|  13242843 | Q130832      | C.P225    |     115136 |
|  13242844 | Q130832      | C.P227    |     115136 |
|  13242845 | Q130832      | C.P244    |     115136 |
|  13242853 | Q130832      | C.P245    |     115136 |
|  13242837 | Q130832      | C.P25     |     115136 |
|  13242838 | Q130832      | C.P26     |     115136 |
|  13242834 | Q130832      | C.P263    |     115136 |
|  13242850 | Q130832      | C.P268    |     115136 |
|  13242849 | Q130832      | C.P269    |     115136 |
|  13242856 | Q130832      | C.P349    |     115136 |
|  13242842 | Q130832      | C.P373    |     115136 |
|  13242861 | Q130832      | C.P396    |     115136 |
|  13242839 | Q130832      | C.P40     |     115136 |
|  13242855 | Q130832      | C.P409    |     115136 |
|  13242854 | Q130832      | C.P434    |     115136 |
|  13242840 | Q130832      | C.P451    |     115136 |
|  13242835 | Q130832      | C.P460    |     115136 |
|  13242851 | Q130832      | C.P496    |     115136 |
|  13242859 | Q130832      | C.P650    |     115136 |
|  13242852 | Q130832      | C.P651    |     115136 |
|  13242860 | Q130832      | C.P691    |     115136 |
|  13242848 | Q130832      | C.P906    |     115136 |
|  13242841 | Q130832      | C.P935    |     115136 |
|  13242862 | Q130832      | C.P950    |     115136 |
|    506488 | Q130832      | S         |     115136 |
|   7737086 | Q130832      | T         |     115136 |
+-----------+--------------+-----------+------------+
34 rows in set (0.001 sec)

We are not able to exclude that the mwdebug1001 server was not correctly sending log events.

I’m also not entirely sure if mwdebug ensures that jobs queued by it only run on mwdebug as well; if it doesn’t, then a related job might have run on a non-mwdebug host that didn’t have the logging code.

I tinkered a bit more with this in production (mwscript shell eowiki on mwdebug1001). Specifically, I was curious what the entity usages for eowiki:Perseo looked like before they were deduplicated / combined, so I hacked together some code to render the revision from scratch.

$s = mws()
$rr = $s->getRevisionRenderer()
$rl = $s->getRevisionLookup()
$rev = $rl->getRevisionByPageId( 115136 )
$rendered = $rr->getRenderedRevision( $rev )
$po = $rendered->getSlotParserOutput( 'main' )
$usages = $po->getExtensionData( 'wikibase-entity-usage' )

And then fed them into the UsageDeduplicator (getting it from the UsageAccumulatorFactory’s private field out of laziness – sudo is a special PsySh command to bypass access checks):

$uaf = wbc::getUsageAccumulatorFactory()
sudo $ud = $uaf->usageDeduplicator
$euf = new Wikibase\Client\Usage\EntityUsageFactory( wbc::getEntityIdParser() )
$usageObjects = array_map( fn ( $str ) => $euf->newFromIdentity( $str ), array_keys( $usages ) )

It turns out the original usages have 52 different statement usages for Q130832 (the connected item). This is well above the configured $wgWBClientSettings['entityUsageModifierLimits']['C'] in production (33), so they correctly get collapsed into a single “C” usage:

> count( $usageObjects )
= 93

> count( $ud->deduplicate( $usageObjects ) )
= 42

So at this point, it makes sense that the individual C.P% usages are removed and only the C usage is kept. But then at some later point, presumably in DataUpdateHookHandler::onParserCacheSaveComplete() / AddUsagesForPageJob, we somehow have fewer than 33 statement usages left, so they get (re)added individually.

I think we have two possible paths to continue on from here:

  • Understand where this second, shorter list of usages comes from.
  • Fix DataUpdateHookHandler::onParserCacheSaveComplete() to not re-add C.P% usages if a C usage already exists (and likewise for other aspects), i.e. replace the current $newUsages = array_diff_key( $usages, $currentUsages ) with something a bit smarter.

I think the second part is something we should do sooner or later, but arguably it’s not the root cause of the problem, so maybe we should look at the first part first.

I put together a few Wikidata queries to show the different properties, as well as whether the properties are present on the Perseus item:

(The 31 properties re-added later thankfully don’t include any new properties that weren’t in the original set of 52.)

Looking at the results, it certainly seems like there’s some logic behind them; the 31 properties re-added later include a lot of external identifiers, which probably come from the authority control box at the bottom of the article:

image.png (55×948 px, 15 KB)

As well as some basic “biographical” properties which probably come from one of these two infoboxes (no idea why the article has two):
image.png (354×260 px, 22 KB)image.png (485×280 px, 80 KB)
On the other hand, the other 21 properties don’t include any external identifiers, generally feel like less-basic “biographical” properties, and most of them aren’t used on the Perseus item (with the exception of “image” and “present in work”).

Change 984848 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@wmf/1.42.0-wmf.9] Add debug code for entity usage logic issue

https://gerrit.wikimedia.org/r/984848

Update: We (@Lucas_Werkmeister_WMDE, @Michael)have the theory that there are two different parse results with different counts of statements, and that is causing this conflict.

We believe the two codepaths are:

  • doLinksUpdateComplete -> replaceUsagesForPage
  • onParserCashSaveComplete->addUsagesForPage

They should have the same usages list of usages, but they seem not to for some wikis in production (hence the conflicting database writes). Since the usage-counting code is the same in both cases, we suspect that there must be a difference in the parser output. @Lucas_Werkmeister_WMDE is going to try deploy some more debugging code to mwdebug1001 to instrument these paths and test our assumption.

We expect that both calls happen on the same host - we instrumented both code paths with logging and deployed the changed code to mwdebug.

Change 984136 abandoned by Lucas Werkmeister (WMDE):

[mediawiki/extensions/Wikibase@master] Add debug code for entity usage logic issue

Reason:

this should really have been based on wmf.9

https://gerrit.wikimedia.org/r/984136

I temporarily deployed this debug change on mwdebug2001 and null-edited eowiki:Perseo there. (Twice, because the first time I didn’t look at the usage table.) The second time, the C.P% rows were definitely rewritten (their row IDs changed; I assume this happened the first time too but I didn’t check it then), but in logstash, both requests logged the same 42 usages (what would be expected from the original 93 usages after deduplication).

Dec 21, 2023 @ 15:45:07.621 – T255706 debug: doLinksUpdateComplete with 42 usages
Dec 21, 2023 @ 15:45:07.764 – T255706 debug: onParserCacheSaveComplete with 42 usages

There is also no additional debug call with another list of usages on eowiki. (There were a few more on enwiki – presumably someone was also using mwdebug for some unrelated reason at the same time.)

I think we want to check whether there are any other places that add usages, because this little experiment speaks against doLinksUpdateComplete() and onParserCacheSaveComplete() (though I wouldn’t rule them out completely yet).

[…] I’m also not entirely sure if mwdebug ensures that jobs queued by it only run on mwdebug as well; […]

Not only do we not ensure this, jobs never run somewhere besides jobrunners. Not on regular appservers, api_appservers, mwmaint, or indeed mwdebug servers.

Queuing a job adds an event to Kafka, after which the jobrunner (changeprop-jobqueue at WMF) executes them over RPC on jobrunner servers. Origin hostname is not part of job metadata. And given the public nature of mwdebug, and the impact/longevity of some jobs (e.g. video transcode, and recursive refreshLinks) we might not want to (always) run jobs on mwdebug. Doing so might make inconsistencies on production sites a bit too easy to cause. Having a way to construct and then run a single concrete job on mwmaint or mwdebug shell would definitely be nice though, i.e. without queuing it. This is currently possible via eval.php, but we don't have documented examples of this afaik. Doing so would allow you to capture the verbose debug output of it (e.g. eval.php -d 2).

The other option you have for jobs, is the AdhocDebug channel, which is enabled even in production. You'd want to make sure the job is low frequency, or make the logger conditional in a wmf-branch patch to certain page titles only (or other job params).

We did a bit more digging this morning. Based on our previous investigation, it seems that we get the same list of statements from the parser cache, with the same (deduplicated) set of usages (i.e. without individual C:XXX statements).

We had a dig around to see if we could spot any addUsagesForPage jobs being queued in Kafka, checking the queues on both the eqiad and codfw brokers because we see in grafana that the jobs run in both datacenters.

Ultimately, we weren't able to detect any jobs being queued when we made null edits to our target page. This makes sense because there is an early-return in the onParserCacheSaveComplete function for the case where no new usages are being added.

Our next investigation step will be to instrument the addUsages function of EntityUsageTable directly to watch for any writes to the target page, and deploy the instrumented code to all hosts. But we'll do that after the holidays.

[…] I’m also not entirely sure if mwdebug ensures that jobs queued by it only run on mwdebug as well; […]

Not only do we not ensure this, jobs never run somewhere besides jobrunners. Not on regular appservers, api_appservers, mwmaint, or indeed mwdebug servers.

Thanks, that’s good to know. I added it to WikimediaDebug § Limitations for future reference.

We had a dig around to see if we could spot any addUsagesForPage jobs being queued in Kafka, checking the queues on both the eqiad and codfw brokers because we see in grafana that the jobs run in both datacenters.

We tried various commands along the lines of

kafkacat -C -b kafka-jumbo1013.eqiad.wmnet:9092 -t codfw.mediawiki.job.wikibase-addUsagesForPage
kafkacat -C -b kafka-jumbo1013.eqiad.wmnet:9092 -t eqiad.mediawiki.job.wikibase-addUsagesForPage

We saw plenty of jobs being queued (roughly a few million per ten seconds or so), including some on eowiki, but as Arthur said, none for our target page (Perseo).

Our next investigation step will be to instrument the addUsages function of EntityUsageTable directly to watch for any writes to the target page, and deploy the instrumented code to all hosts. But we'll do that after the holidays.

Proposed debug code on PS4 of the debug code Gerrit change.

We tried various commands along the lines of

kafkacat -C -b kafka-jumbo1013.eqiad.wmnet:9092 -t codfw.mediawiki.job.wikibase-addUsagesForPage
kafkacat -C -b kafka-jumbo1013.eqiad.wmnet:9092 -t eqiad.mediawiki.job.wikibase-addUsagesForPage

It turns out kafkacat behavior differs from what Wikitech used to say – without an explicit -o option, it replays the entire message buffer, so we were seeing jobs from December 15: apparently the buffer for this topic spans 7 days. (I’ve updated Wikitech accordingly now.) This means that the job was probably being queued after all, we just didn’t see it. (We should check that again later.)

without an explicit -o option, it replays the entire message buffer, so we were seeing jobs from December 15: apparently the buffer for this topic spans 7 days.

This is actually useful because we can still see the job that was enqueued as a result of the experiment yesterday (see T255706#9421575 above; link to Gerrit diff that was deployed). We can see that this logstash event, with the logged usages

Q130832#T, Q130832#C, Q130832#S, Q6581097#T, Q6581097#L.eo, Q3476559#T, Q3476559#L.eo, Q3476559#L.en, Q34201#T, Q34201#L.eo, Q190582#T, Q190582#L.eo, Q161582#T, Q161582#L.eo, Q950523#T, Q950523#L.eo, Q1351992#T, Q1351992#L.eo, Q142739#T, Q142739#L.eo, Q26296#T, Q26296#L.eo, Q1705660#T, Q1705660#L.eo, Q166155#T, Q166155#L.eo, Q4826203#T, Q4826203#L.eo, Q4826203#L.en, Q5200183#T, Q5200183#L.eo, Q5200183#L.en, Q1062299#T, Q1062299#L.eo, Q60619064#T, Q60619064#L.eo, Q60619064#L.en, Q31841846#T, Q31841846#L.eo, Q31841846#L.en, Q184742#T, Q184742#L.eo

eventually enqueued the following job:

{"$schema":"/mediawiki/job/1.0.0","meta":{"uri":"https://placeholder.invalid/wiki/Special:Badtitle","request_id":"2e0f9342-6b87-4131-8624-12c0159e5d73","id":"aa8e7ae3-a233-4e1f-9d82-81189e09367c","dt":"2023-12-21T15:45:08Z","domain":"eo.wikipedia.org","stream":"mediawiki.job.wikibase-addUsagesForPage"},"database":"eowiki","type":"wikibase-addUsagesForPage","sha1":"8ad2a549a2258f3c839f81639c8f42eb6537b789","params":{"namespace":0,"title":"Perseo","pageId":115136,"usages":{"Q130832#C.P1559":{"entityId":"Q130832","aspect":"C","modifier":"P1559"},"Q130832#C.P21":{"entityId":"Q130832","aspect":"C","modifier":"P21"},"Q130832#C.P263":{"entityId":"Q130832","aspect":"C","modifier":"P263"},"Q130832#C.P460":{"entityId":"Q130832","aspect":"C","modifier":"P460"},"Q130832#C.P22":{"entityId":"Q130832","aspect":"C","modifier":"P22"},"Q130832#C.P25":{"entityId":"Q130832","aspect":"C","modifier":"P25"},"Q130832#C.P26":{"entityId":"Q130832","aspect":"C","modifier":"P26"},"Q130832#C.P40":{"entityId":"Q130832","aspect":"C","modifier":"P40"},"Q130832#C.P451":{"entityId":"Q130832","aspect":"C","modifier":"P451"},"Q130832#C.P935":{"entityId":"Q130832","aspect":"C","modifier":"P935"},"Q130832#C.P373":{"entityId":"Q130832","aspect":"C","modifier":"P373"},"Q130832#C.P225":{"entityId":"Q130832","aspect":"C","modifier":"P225"},"Q130832#C.P227":{"entityId":"Q130832","aspect":"C","modifier":"P227"},"Q130832#C.P244":{"entityId":"Q130832","aspect":"C","modifier":"P244"},"Q130832#C.P214":{"entityId":"Q130832","aspect":"C","modifier":"P214"},"Q130832#C.P213":{"entityId":"Q130832","aspect":"C","modifier":"P213"},"Q130832#C.P906":{"entityId":"Q130832","aspect":"C","modifier":"P906"},"Q130832#C.P269":{"entityId":"Q130832","aspect":"C","modifier":"P269"},"Q130832#C.P268":{"entityId":"Q130832","aspect":"C","modifier":"P268"},"Q130832#C.P496":{"entityId":"Q130832","aspect":"C","modifier":"P496"},"Q130832#C.P651":{"entityId":"Q130832","aspect":"C","modifier":"P651"},"Q130832#C.P245":{"entityId":"Q130832","aspect":"C","modifier":"P245"},"Q130832#C.P434":{"entityId":"Q130832","aspect":"C","modifier":"P434"},"Q130832#C.P409":{"entityId":"Q130832","aspect":"C","modifier":"P409"},"Q130832#C.P349":{"entityId":"Q130832","aspect":"C","modifier":"P349"},"Q130832#C.P1015":{"entityId":"Q130832","aspect":"C","modifier":"P1015"},"Q130832#C.P1053":{"entityId":"Q130832","aspect":"C","modifier":"P1053"},"Q130832#C.P650":{"entityId":"Q130832","aspect":"C","modifier":"P650"},"Q130832#C.P691":{"entityId":"Q130832","aspect":"C","modifier":"P691"},"Q130832#C.P396":{"entityId":"Q130832","aspect":"C","modifier":"P396"},"Q130832#C.P950":{"entityId":"Q130832","aspect":"C","modifier":"P950"}},"requestId":"2e0f9342-6b87-4131-8624-12c0159e5d73"},"mediawiki_signature":"5147de7e7f9535880c822cd1aadbe0cf20fa42c1"}

As you can see, by the time of the log message, $usages contained only one Q130832#C usage; but the $newUsages in the job somehow contained 31 individual Q130832#C.P% usages again. Looking at the code, we have no idea where these could come from; we want to add some more logging there, to see the $usages, $currentUsages and $newUsages.

Proposed debug code on PS4 of the debug code Gerrit change.

(This is now obsolete. We know where the database insert comes from – we’ve seen the corresponding AddUsagesForPage job. The mystery now is why it’s being created the way it is, and logging EntityUsageTable won’t help with that.)

Change 987962 had a related patch set uploaded (by Arthur taylor; author: Arthur taylor):

[mediawiki/extensions/Wikibase@master] Make addUsedEntities aware of statement grouping

https://gerrit.wikimedia.org/r/987962

In some further investigation today, I was able to reproduce at least part of the issue on my local setup by doing the following, with $wgWBClientSettings['entityUsageModifierLimits']['C'] = 2

  • Start, as expected, with generic C statement in database - in this case I have 3 statements referring to the same item, against a limit of 2.
  • Remove the third statement, to bring us under the limit.
    • onParserCacheSaveComplete fires, followed by doLinksUpdateComplete
    • 2 Specific C's appear in the database, and a job is queued to add 2 usages
  • Add the third statement back again, to take us over the limit
    • onParserCacheSaveComplete fires, followed by doLinksUpdateComplete
    • 1 generic C is in the database, and a job is queued to add a generic C.
  • Run the jobs
    • The first jobs adds the 2 specific C statements to the database
    • The second job would add the generic statement, but it is already present

Subsequently making a null edit clears the issue out and leaves me with the generic C in my usages table.

I dug around to see where the different functions were getting triggered, and everything is happening via DeferredUpdates (besides the job runs themselves). onLinksUpdateComplete is scheduled as a deferred AutoCommitUpdate in LinksUpdate.php, and onParserCacheSaveComplete is triggered by the doModify function of PageUpdater.php It's not clear to me that the order of the execution of the DeferredUpdates is 100% deterministic (based on the log messages captured above (T255706#9421575), it appears that at least on one machine the jobs run in the other order to the way they run on my machine.

The patch I've drafted above changes the semantics of addUsedEntities to make it aware of statement grouping - so that it won't add a Q1#C.P12 where a Q1#C is present, and will remove all Q1#C.PX statements when adding a Q1#C statement. My hope is that this makes the tasks more serialisable / less order-dependent. But I want to confer with @Michael and @Lucas_Werkmeister_WMDE about whether the approach makes sense before moving forward.

Change 989125 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@wmf/1.42.0-wmf.12] Add debug code for entity usage logic issue

https://gerrit.wikimedia.org/r/989125

Change 984848 abandoned by Lucas Werkmeister (WMDE):

[mediawiki/extensions/Wikibase@wmf/1.42.0-wmf.9] Add debug code for entity usage logic issue

Reason:

wmf.9 is no longer deployed

https://gerrit.wikimedia.org/r/984848

We deployed PS1 of the debug code on mwdebug2001 and null-edited eowiki:Perseo. The result was this logstash event, with:

currentUsages
Q1062299#T, Q1062299#L.eo, Q12502038#T, Q12502038#L.eo, Q12502038#L.en, Q130832#S, Q130832#T, Q130832#C, Q130832#C.P1559, Q130832#C.P21, Q130832#C.P263, Q130832#C.P460, Q130832#C.P22, Q130832#C.P25, Q130832#C.P26, Q130832#C.P40, Q130832#C.P451, Q130832#C.P935, Q130832#C.P373, Q130832#C.P225, Q130832#C.P227, Q130832#C.P244, Q130832#C.P214, Q130832#C.P213, Q130832#C.P906, Q130832#C.P269, Q130832#C.P268, Q130832#C.P496, Q130832#C.P651, Q130832#C.P245, Q130832#C.P434, Q130832#C.P409, Q130832#C.P349, Q130832#C.P1015, Q130832#C.P1053, Q130832#C.P650, Q130832#C.P691, Q130832#C.P396, Q130832#C.P950, Q1351992#T, Q1351992#L.eo, Q142739#T, Q142739#L.eo, Q161582#T, Q161582#L.eo, Q166155#T, Q166155#L.eo, Q1705660#T, Q1705660#L.eo, Q184742#T, Q184742#L.eo, Q190582#T, Q190582#L.eo, Q23015925#L.eo, Q23015925#T, Q26296#T, Q26296#L.eo, Q31841846#T, Q31841846#L.eo, Q31841846#L.en, Q34201#T, Q34201#L.eo, Q3476559#T, Q3476559#L.eo, Q3476559#L.en, Q4826203#T, Q4826203#L.eo, Q4826203#L.en, Q5200183#T, Q5200183#L.eo, Q5200183#L.en, Q60619064#T, Q60619064#L.eo, Q60619064#L.en, Q6581097#T, Q6581097#L.eo, Q950523#T, Q950523#L.eo, Q130832#D.eo
usages
Q130832#T, Q130832#C, Q130832#S, Q6581097#T, Q6581097#L.eo, Q3476559#T, Q3476559#L.eo, Q3476559#L.en, Q34201#T, Q34201#L.eo, Q190582#T, Q190582#L.eo, Q161582#T, Q161582#L.eo, Q950523#T, Q950523#L.eo, Q1351992#T, Q1351992#L.eo, Q142739#T, Q142739#L.eo, Q26296#T, Q26296#L.eo, Q1705660#T, Q1705660#L.eo, Q166155#T, Q166155#L.eo, Q4826203#T, Q4826203#L.eo, Q4826203#L.en, Q5200183#T, Q5200183#L.eo, Q5200183#L.en, Q1062299#T, Q1062299#L.eo, Q60619064#T, Q60619064#L.eo, Q60619064#L.en, Q31841846#T, Q31841846#L.eo, Q31841846#L.en, Q23015925#L.eo, Q23015925#T, Q12502038#T, Q12502038#L.eo, Q12502038#L.en, Q184742#T, Q184742#L.eo

and newUsages empty (presumably the empty array, though logstash doesn’t quite show that) and numNewUsages=0. So given that we believe $newUsages to be the empty array, the code should have returned early and not enqueued a job at all:

DataUpdateHookHandler::onParserCacheSaveComplete()
if ( $newUsages === [] ) {
	return;
}
$addUsagesForPageJob = AddUsagesForPageJob::newSpec( $title, $newUsages );
$this->jobScheduler->push( $addUsagesForPageJob );

And yet, we saw the job created in kafkacat:

kafkacat -C -b kafka-jumbo1013.eqiad.wmnet:9092 -t codfw.mediawiki.job.wikibase-addUsagesForPage -o -100000 -e | grep -E 'eowiki|eo.wikipedia.org' | less
{"$schema":"/mediawiki/job/1.0.0","meta":{"uri":"https://placeholder.invalid/wiki/Special:Badtitle","request_id":"9e0e2560-bae7-4857-beb4-32d1276062c2","id":"db1a7a9e-f511-405e-a8de-f5e55659af0f","dt":"2024-01-09T11:16:42Z","domain":"eo.wikipedia.org","stream":"mediawiki.job.wikibase-addUsagesForPage"},"database":"eowiki","type":"wikibase-addUsagesForPage","sha1":"8ad2a549a2258f3c839f81639c8f42eb6537b789","params":{"namespace":0,"title":"Perseo","pageId":115136,"usages":{"Q130832#C.P1559":{"entityId":"Q130832","aspect":"C","modifier":"P1559"},"Q130832#C.P21":{"entityId":"Q130832","aspect":"C","modifier":"P21"},"Q130832#C.P263":{"entityId":"Q130832","aspect":"C","modifier":"P263"},"Q130832#C.P460":{"entityId":"Q130832","aspect":"C","modifier":"P460"},"Q130832#C.P22":{"entityId":"Q130832","aspect":"C","modifier":"P22"},"Q130832#C.P25":{"entityId":"Q130832","aspect":"C","modifier":"P25"},"Q130832#C.P26":{"entityId":"Q130832","aspect":"C","modifier":"P26"},"Q130832#C.P40":{"entityId":"Q130832","aspect":"C","modifier":"P40"},"Q130832#C.P451":{"entityId":"Q130832","aspect":"C","modifier":"P451"},"Q130832#C.P935":{"entityId":"Q130832","aspect":"C","modifier":"P935"},"Q130832#C.P373":{"entityId":"Q130832","aspect":"C","modifier":"P373"},"Q130832#C.P225":{"entityId":"Q130832","aspect":"C","modifier":"P225"},"Q130832#C.P227":{"entityId":"Q130832","aspect":"C","modifier":"P227"},"Q130832#C.P244":{"entityId":"Q130832","aspect":"C","modifier":"P244"},"Q130832#C.P214":{"entityId":"Q130832","aspect":"C","modifier":"P214"},"Q130832#C.P213":{"entityId":"Q130832","aspect":"C","modifier":"P213"},"Q130832#C.P906":{"entityId":"Q130832","aspect":"C","modifier":"P906"},"Q130832#C.P269":{"entityId":"Q130832","aspect":"C","modifier":"P269"},"Q130832#C.P268":{"entityId":"Q130832","aspect":"C","modifier":"P268"},"Q130832#C.P496":{"entityId":"Q130832","aspect":"C","modifier":"P496"},"Q130832#C.P651":{"entityId":"Q130832","aspect":"C","modifier":"P651"},"Q130832#C.P245":{"entityId":"Q130832","aspect":"C","modifier":"P245"},"Q130832#C.P434":{"entityId":"Q130832","aspect":"C","modifier":"P434"},"Q130832#C.P409":{"entityId":"Q130832","aspect":"C","modifier":"P409"},"Q130832#C.P349":{"entityId":"Q130832","aspect":"C","modifier":"P349"},"Q130832#C.P1015":{"entityId":"Q130832","aspect":"C","modifier":"P1015"},"Q130832#C.P1053":{"entityId":"Q130832","aspect":"C","modifier":"P1053"},"Q130832#C.P650":{"entityId":"Q130832","aspect":"C","modifier":"P650"},"Q130832#C.P691":{"entityId":"Q130832","aspect":"C","modifier":"P691"},"Q130832#C.P396":{"entityId":"Q130832","aspect":"C","modifier":"P396"},"Q130832#C.P950":{"entityId":"Q130832","aspect":"C","modifier":"P950"}},"requestId":"9e0e2560-bae7-4857-beb4-32d1276062c2"},"mediawiki_signature":"3e62429ff1d741b8a7ab994f97dd86ba141dea50"}

We have no idea how the job could have been created, and will add even more debug logging around that early return.

We deployed PS2 of the debug code on mwdebug2001, and only saw two logstash messages:

This demonstrates that the early return was indeed taken, and no job should have been enqueued – and yet it was, with the same request ID as those logstash messages (reqId 80a7e163-2dc6-4202-a5f8-ff0ac7b910ac):

{"$schema":"/mediawiki/job/1.0.0","meta":{"uri":"https://placeholder.invalid/wiki/Special:Badtitle","request_id":"80a7e163-2dc6-4202-a5f8-ff0ac7b910ac","id":"a31c55a1-a716-4286-b57a-0dcdddf583eb","dt":"2024-01-09T11:38:46Z","domain":"eo.wikipedia.org","stream":"mediawiki.job.wikibase-addUsagesForPage"},"database":"eowiki","type":"wikibase-addUsagesForPage","sha1":"8ad2a549a2258f3c839f81639c8f42eb6537b789","params":{"namespace":0,"title":"Perseo","pageId":115136,"usages":{"Q130832#C.P1559":{"entityId":"Q130832","aspect":"C","modifier":"P1559"},"Q130832#C.P21":{"entityId":"Q130832","aspect":"C","modifier":"P21"},"Q130832#C.P263":{"entityId":"Q130832","aspect":"C","modifier":"P263"},"Q130832#C.P460":{"entityId":"Q130832","aspect":"C","modifier":"P460"},"Q130832#C.P22":{"entityId":"Q130832","aspect":"C","modifier":"P22"},"Q130832#C.P25":{"entityId":"Q130832","aspect":"C","modifier":"P25"},"Q130832#C.P26":{"entityId":"Q130832","aspect":"C","modifier":"P26"},"Q130832#C.P40":{"entityId":"Q130832","aspect":"C","modifier":"P40"},"Q130832#C.P451":{"entityId":"Q130832","aspect":"C","modifier":"P451"},"Q130832#C.P935":{"entityId":"Q130832","aspect":"C","modifier":"P935"},"Q130832#C.P373":{"entityId":"Q130832","aspect":"C","modifier":"P373"},"Q130832#C.P225":{"entityId":"Q130832","aspect":"C","modifier":"P225"},"Q130832#C.P227":{"entityId":"Q130832","aspect":"C","modifier":"P227"},"Q130832#C.P244":{"entityId":"Q130832","aspect":"C","modifier":"P244"},"Q130832#C.P214":{"entityId":"Q130832","aspect":"C","modifier":"P214"},"Q130832#C.P213":{"entityId":"Q130832","aspect":"C","modifier":"P213"},"Q130832#C.P906":{"entityId":"Q130832","aspect":"C","modifier":"P906"},"Q130832#C.P269":{"entityId":"Q130832","aspect":"C","modifier":"P269"},"Q130832#C.P268":{"entityId":"Q130832","aspect":"C","modifier":"P268"},"Q130832#C.P496":{"entityId":"Q130832","aspect":"C","modifier":"P496"},"Q130832#C.P651":{"entityId":"Q130832","aspect":"C","modifier":"P651"},"Q130832#C.P245":{"entityId":"Q130832","aspect":"C","modifier":"P245"},"Q130832#C.P434":{"entityId":"Q130832","aspect":"C","modifier":"P434"},"Q130832#C.P409":{"entityId":"Q130832","aspect":"C","modifier":"P409"},"Q130832#C.P349":{"entityId":"Q130832","aspect":"C","modifier":"P349"},"Q130832#C.P1015":{"entityId":"Q130832","aspect":"C","modifier":"P1015"},"Q130832#C.P1053":{"entityId":"Q130832","aspect":"C","modifier":"P1053"},"Q130832#C.P650":{"entityId":"Q130832","aspect":"C","modifier":"P650"},"Q130832#C.P691":{"entityId":"Q130832","aspect":"C","modifier":"P691"},"Q130832#C.P396":{"entityId":"Q130832","aspect":"C","modifier":"P396"},"Q130832#C.P950":{"entityId":"Q130832","aspect":"C","modifier":"P950"}},"requestId":"80a7e163-2dc6-4202-a5f8-ff0ac7b910ac"},"mediawiki_signature":"2c7097c68efae5751591caed7b353906151683cd"}

We were quite baffled by this, but eventually looked through some recent EventBus log messages (dd ranges found by trial and error based on the timestamps):

dd if=/srv/mw-log/EventBus.log of=/dev/stdout bs=1M skip=17000 count=100 | grep 80a7e163-2dc6-4202-a5f8-ff0ac7b910ac
2024-01-09 11:38:45.788103 [80a7e163-2dc6-4202-a5f8-ff0ac7b910ac] mwdebug2001 eowiki 1.42.0-wmf.12 EventBus DEBUG: Using event intake service eventgate-main for stream resource_change.  
2024-01-09 11:38:45.923180 [80a7e163-2dc6-4202-a5f8-ff0ac7b910ac] mwdebug2001 eowiki 1.42.0-wmf.12 EventBus DEBUG: Using event intake service eventgate-main for stream mediawiki.job.cirrusSearchLinksUpdate.  
2024-01-09 11:38:45.996112 [80a7e163-2dc6-4202-a5f8-ff0ac7b910ac] mwdebug2001 eowiki 1.42.0-wmf.12 EventBus DEBUG: Using event intake service eventgate-analytics-external for stream eventlogging_EditAttemptStep.  
2024-01-09 11:38:46.018591 [80a7e163-2dc6-4202-a5f8-ff0ac7b910ac] mwdebug2001 eowiki 1.42.0-wmf.12 EventBus DEBUG: Using event intake service eventgate-analytics-external for stream eventlogging_VisualEditorFeatureUse.  
2024-01-09 11:38:46.024510 [80a7e163-2dc6-4202-a5f8-ff0ac7b910ac] mwdebug2001 eowiki 1.42.0-wmf.12 EventBus DEBUG: Using event intake service eventgate-analytics-external for stream eventlogging_EditAttemptStep.  
2024-01-09 11:38:46.046193 [80a7e163-2dc6-4202-a5f8-ff0ac7b910ac] mwdebug2001 eowiki 1.42.0-wmf.12 EventBus DEBUG: Using event intake service eventgate-main for stream mediawiki.job.parsoidCachePrewarm.  
2024-01-09 11:38:46.107792 [80a7e163-2dc6-4202-a5f8-ff0ac7b910ac] mw2278 eowiki 1.42.0-wmf.12 EventBus DEBUG: Using event intake service eventgate-main for stream mediawiki.job.cirrusSearchElasticaWrite.  
2024-01-09 11:38:46.777306 [80a7e163-2dc6-4202-a5f8-ff0ac7b910ac] mw2279 eowiki 1.42.0-wmf.12 EventBus DEBUG: Using event intake service eventgate-main for stream mediawiki.job.RecordLintJob.  
2024-01-09 11:38:46.874375 [80a7e163-2dc6-4202-a5f8-ff0ac7b910ac] mw2279 eowiki 1.42.0-wmf.12 EventBus DEBUG: Using event intake service eventgate-main for stream mediawiki.job.wikibase-addUsagesForPage.

We can see that the first few log messages here come from mwdebug2001, but then there are some from regular MediaWiki app servers! This would explain why we didn’t see a log message for the job being enqueued: we only deployed the logging code to mwdebug2001, after all.

Our current theory is that the parsoidCachePrewarm job, scheduled on mwdebug2001, ran on mw2279 and caused a new parser cache write, which then created the wikibase-addUsagesForPage job. To test this, we would have to deploy debug logging to all of production, not just mwdebug (similar to the debug code at the end of T255706#9423170, which we didn’t end up deploying after all).

We looked around this a bit more; you can get Parsoid output with useparsoid=1, and non-Parsoid output with useparsoid=0 (you can see the difference in the “Saved in parser cache” HTML comment). We also pulled the parser output for both variants out of the parser cache, with something like:

mwscript shell eowiki
$services = mws()
$pc = $services->getParserCache()
$wpf = $services->getWikiPageFactory()
$wp = $wpf->newFromID( 115136 ) # page ID of Perseo
$parserOutputCanonical = $pc->get( $wp, ParserOptions::newFromAnon() )
$parsoidAccess = $services->getParsoidOutputAccess()
$parserOutputParsoid = $parsoidAccess->getCachedParserOutput($wp, ParserOptions::newFromAnon())

We noticed that a lot of the missing usage tracking seems to be “clustered” in the upper half of the second infobox – everything up to and including the Verko:

image.png (568×280 px, 84 KB)

Perhaps Parsoid is reusing this from a different cache somehow? And not merging the usage tracking information with what it got from the cache?

I did some more digging today. Per the comments from Isabelle in this slack thread, when Parsoid processes Wikitext, it spins up an isolated sub-parse with its own ParserOutput. Specifically for Templates, it has a TemplateHandler that spawns a wikitext preprocessor when it encounters a template tag.

The preprocessor runs the Lua correctly, and the usages are always registered in the ExtensionData of the parserOutput object present in the context of the Lua parse. But this parserOutput object is not the object that is eventually passed up to the output of the ParsoidCachePrewarmJob.

There is code to collect the usages from the internal parse, which often / mostly works. The collectMetadata function transfers the ExtensionData to the active parserOutput object.

What I suspect is happening is that for some recursive templates, whatever is doing the recursion is at some point just replacing the template text with the rendered text and throwing away the parserOutput. Specifically we know that in this template the addition of P509 in the lua code:

{{Informkesto/Opcia linio2
 |priskribo=Mortokialo|teksto={{#invoke:Wikidata|claim|P509|parameter=link|list=, }}
}}

does not get propagated up. I will have a look again tomorrow to see if I can spot why.

So after a lot more staring at stack traces and with some help from base banana, I have a very concrete theory about the cause of the issue, and I am experimentally able to resolve the issue on my local machine.

As part of the initialization of the singleton Scribunto_LuaWikibaseEntityLibrary object, multiple UsageAccumulators are instantiated which store their usage records to the ParserOutput that is active when Scribunto_LuaWikibaseEntityLibrary::getImplementation is called for the first time:

	public function getUsageAccumulator(): UsageAccumulator {
		return WikibaseClient::getUsageAccumulatorFactory()
			->newFromParserOutput( $this->getParser()->getOutput() );
	}

During Parsoid parsing of templates and wikitext fragments, repeated calls to parser->resetOutput create new ParserOutput objects, which Parsoid correctly tracks and correctly gathers the usages from with matching calls to collectMetadata.

Unfortunately, the singleton Scribunto_LuaWikibaseEntityLibrary retains the reference to the original ParserOutput, which Parsoid no longer has a reference to, causing some subsequent calls to addUsage to store their usages in the wrong ParserOutput object:

	public function addUsage( EntityUsage $usage ) {
		$this->parserOutput->appendExtensionData(
			self::EXTENSION_DATA_KEY, $usage->getIdentityString()
		);
	}

This results in the Parsoid jobs tracking fewer usages than the legacy parses. In the case where the Parsoid usage count is below the entityUsageModifierLimits and the legacy usage count is above, we end up with the situation that doLinksUpdateComplete will try and delete the C.PXXX statements while Parsoid's onParserCacheSaveComplete wants to add them back. I expect that this triggers the deadlock in production.

I still need a little more time to reproduce the issue in a unit test locally, and then to reduce that to a minimal reproduction. My quick fix involves checking in the singleton getter to see if the ParserOutput object reference has changed:

	private function getImplementation(): WikibaseLuaEntityBindings {
		if ( !$this->wbLibrary ) {
			$this->wbLibrary = $this->newImplementation();
		}
		if ( $this->wbLibrary->getActiveParserOutput() !==
			$this->getEngine()->getParser()->getOutput() )
		{
			/* The cached parser output needs to be thrown away */
			$this->wbLibrary = $this->newImplementation();
		}

		return $this->wbLibrary;
	}

but this is obviously a little error prone. I'm not sure if the best approach would be to eliminate the singleton entirely, or to introduce lifecycle hooks so that the UsageAccumulators can observe when the active ParserOutput changes.

Nice investigation, thank you! Now it’s making more sense.

My quick fix involves checking in the singleton getter to see if the ParserOutput object reference has changed

I think this is actually a pretty good approach – the identity of the ParserOutput instance should be a fairly good proxy for “are we still in the same parse or not?” (which matters not just for the ParserOutput, but also for things like the Language, which we get from the ParserOptions and which in theory can probably change as well) – and I haven’t found a much better one so far. (The ParserOutput::mParseStartTime looks interesting, but it’s both deprecated and not directly accessible; its recommended replacement, getTimeProfile(), is useless during an ongoing parse IIUC. I don’t think the parser output or parse call really have an “ID” of their own, and many of the associated IDs – page, revision, whatever – are going to be the same for multiple parses triggered by Parsoid.)

An alternative might be the ParserClearState hook, which is called in (guess what) Parser::clearState(). We could perhaps subscribe to that hook and reset the singletons in Scribunto_LuaWikibaseEntityLibrary and friends there. But at the moment I think I almost like the ParserOutput identity check better.

An alternative might be the ParserClearState hook, which is called in (guess what) Parser::clearState(). We could perhaps subscribe to that hook and reset the singletons in Scribunto_LuaWikibaseEntityLibrary and friends there. But at the moment I think I almost like the ParserOutput identity check better.

Nevermind, Scribunto already resets the parser engine on this hook. T343229 also feels relevant.

Change 992135 had a related patch set uploaded (by Arthur taylor; author: Arthur taylor):

[mediawiki/extensions/Wikibase@master] Fix bug in usage tracking for Parsoid parses

https://gerrit.wikimedia.org/r/992135

Change 992135 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Fix bug in usage tracking for Parsoid parses

https://gerrit.wikimedia.org/r/992135

Looking at Logstash, there is still an issue here, but it comes up much less often now (reduced by around 90%). I still have a patch that we haven't applied that makes the addUsages job check for the presence of collective C statements before adding individual C statements. This addresses a race which I know is still present and I can reproduce locally, that occurs when the number of statements transitions from below the entityUsageModifierLimits to above. But I haven't done the investigation to see whether that's what's happening in these cases. @Lucas_Werkmeister_WMDE - should we take some more time to dig into the remaining log messages?

Arthur and I looked at some of the occurrences in Logstash, and two affected pages on itwiki are Ungheria and Ridley Scott; in both we see the same pattern that Parsoid parses yield much less entity usage than legacy Parser parses:

function debugT255706( $pageId ) { 
$services = mws();
$pc = $services->getParserCache();
$wpf = $services->getWikiPageFactory();
$wp = $wpf->newFromID( $pageId );
$parserOutputParser = $pc->get( $wp, ParserOptions::newFromAnon() );
$parsoidAccess = $services->getParsoidOutputAccess();
$parserOutputParsoid = $parsoidAccess->getCachedParserOutput($wp, ParserOptions::newFromAnon());
$usagesParser = $parserOutputParser->getExtensionData( 'wikibase-entity-usage' );
$usagesParsoid = $parserOutputParsoid->getExtensionData( 'wikibase-entity-usage' );
return [ count( $usagesParser ), count( $usagesParsoid ) ];
}
> debugT255706(1604654)
= [
    641,
    28,
  ]

> debugT255706(3787)
= [
    661,
    21,
  ]

We also saw that Modulo:Wikidata intentionally avoids mw.wikibase.getEntity() for performance reasons, so it would not have benefited from the previous fix in Scribunto_LuaWikibaseEntityLibrary; we suspect that some fixes in T355877 would help there.

I still have a patch that we haven't applied that makes the addUsages job check for the presence of collective C statements before adding individual C statements. This addresses a race which I know is still present and I can reproduce locally, that occurs when the number of statements transitions from below the entityUsageModifierLimits to above.

I would put that patch slightly elsewhere, e.g. in DataUpdateHookHandler:

diff --git a/client/includes/Hooks/DataUpdateHookHandler.php b/client/includes/Hooks/DataUpdateHookHandler.php
index 587bea29f5..030ef0b8ff 100644
--- a/client/includes/Hooks/DataUpdateHookHandler.php
+++ b/client/includes/Hooks/DataUpdateHookHandler.php
@@ -188,7 +188,8 @@ public function onParserCacheSaveComplete( $parserCache, $parserOutput, $title,
             // during a GET request.
 
             $currentUsages = $this->usageLookup->getUsagesForPage( $title->getArticleID() );
-            $newUsages = array_diff_key( $usages, $currentUsages );
+            $allUsages = $this->usageDeduplicator->deduplicate( array_merge( $usages, $currentUsages ) );
+            $newUsages = array_diff_key( $allUsages, $currentUsages );
             if ( $newUsages === [] ) {
                 return;
             }

But it could also make sense in the job itself (AddUsagesForPageJob::run()). This would probably remove the deadlocks, but not (on its own) fix the problem of bad usage tracking in Parsoid. So let’s fix the usage tracking first and see if the deadlocks go down.

Arthur and I looked at some of the occurrences in Logstash, and two affected pages on itwiki are Ungheria and Ridley Scott; in both we see the same pattern that Parsoid parses yield much less entity usage than legacy Parser parses:

As itwiki is in group1, and the changes from T355877 were just deployed there, I tried this out again (in mw-debug-repl instead of mwscript shell, just for fun), and after null-editing Ungheria, its numbers cleaned up:

> debugT255706(1604654)
= [
    644,
    644,
  ]

So that sounds great. Let’s see if the number of deadlocks goes down over the next week or two. (Note: I didn’t null-edit Ridley Scott just yet, in case we want to “keep” that page for further investigation.)

The deadlock errors have certainly gone down again since the middle of March (when we merged a fix):

image.png (389×678 px, 33 KB)

I think this is a good indication that we’ve fixed the bad usage tracking in Lua under Parsoid; IMHO we can now go ahead with the other fix from Arthur’s patch / T255706#9529086.

Change #989125 abandoned by Lucas Werkmeister (WMDE):

[mediawiki/extensions/Wikibase@wmf/1.42.0-wmf.12] Add debug code for entity usage logic issue

Reason:

apparently I didn’t abandon this one, but this train is long gone ^^

https://gerrit.wikimedia.org/r/989125

Change #987962 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Make addUsedEntities aware of statement grouping

https://gerrit.wikimedia.org/r/987962

@Lucas_Werkmeister_WMDE and I started looking into the error frequency on Logstash: https://logstash.wikimedia.org/goto/720963e5a9f568276a1f056346da1bb9

A few open questions remain:

  • Why did the frequency of errors go up at around the 25th of May?
  • Was it due to some code change in Wikibase at that time?

It appears we were looking at a too broad of a chart, in fact the addUsages deadlocks have indeed dcreased. However, there are some remaining deadlock messages from termstore updates which are being investigated by @hoo in T655055.

Logstash link for non-termstore deadlocks (I think they’re roughly evenly split between addUsages and removeUsages): https://logstash.wikimedia.org/goto/20ade51dc3a72b8b8234467babe021cf