Page MenuHomePhabricator

Investigation of remaining page_props update issue [timebox: 2 days]
Closed, ResolvedPublic

Description

We want to investigate the cause of the remaining issues with updating page_props on the client as lined out in T233520. For context, steps how to reproduce and previous discussions see T233520.

Goal of the investigation: Figure out the cause of the remaining update issue and propose one or more solutions.

Event Timeline

incremental progress update:
I managed to connect the tentative starting point ChangeHandler (you can see ChangeHandlers bigger context in the new arch docs) with the likely endpoint ClientParserOutputDataUpdater. The assumption is that the problem is somewhere between those two places in the code.

  • \Wikibase\Client\Changes\ChangeHandler::handleChange is where we start
  • One of that methods main actions is to call \Wikibase\Client\Changes\WikiPageUpdater::scheduleRefreshLinks
  • which creates a new \RefreshLinksJob and adds it to the job queue via \JobQueueGroup::lazyPush
    • In that constructor a field removeDuplicates is probably set to true in our case - maybe our job gets deduplicated with one that doesn't do the trick?
  • at some point that job is \RefreshLinksJob::run and executes the private method \RefreshLinksJob::runForTitle
  • that runForTitle method then requests \RefreshLinksJob::getParserOutput
  • This method eventually gets fresh parser output, but checks a couple of conditions first, one looks particularly suspect:
		$cachedOutput = $this->getParserOutputFromCache( $parserCache, $page, $revision, $stats );
		if ( $cachedOutput ) {
			return $cachedOutput;
		}
  • however,
    1. we should have dealt with that by unsetting rootJobTimestamp
    2. This should not have been an issue in the first place, because if the timestamp that this job was generated is older than the time when the ParserOutput was last cached, then that means that the the ParserOutput has been generated in the meantime and the changes should have been picked at that moment anyway
  • That being said, if there is seemingly no cached output, then it should be generated by \MediaWiki\Revision\RenderedRevision::getRevisionParserOutput which is too complex for me to dive into, until I have a good reason to suspect that the issue lies somewhere inside
  • but I assume at some point \AbstractContent::getParserOutput gets called
  • which runs the hook onContentAlterParserOutput (c.f. official docs for the hook ContentAlterParserOutput)
  • And we have a handler registered for that hook: \Wikibase\Client\Hooks\ParserOutputUpdateHookHandler::onContentAlterParserOutput
  • That handler ultimately calls the last step \Wikibase\Client\ParserOutput\ClientParserOutputDataUpdater::updateItemIdProperty which is supposed to modify the page props

Something in that chain above is probably broken under some conditions.

Preliminary learning/confusion: We don't actually care about the main functionality of RefreshLinksJob, we care about the ParserOutput being recreated?!?

So after the look that we (@Jakob_WMDE @Michael & I) took yesterday we think we figured out some things.
The reproduction case in T233520#6898755 half works, but we seem to consistently see the page_props table eventually get populated with the needed value.

Seemingly the following is happening:

  • Create a sitelink
    • Dispatching process
    • ParserOutput generation after dispatch
      • We see the page reparse as the timestamp for ParserOutput generation change to the current time.
      • Using shell.php we can see that the page_props added by wikibase hook are immediately available in the new ParserOutput object stored in the cache.
        • This can be checked with something like (MediaWiki\MediaWikiServices::getInstance()->getParserCache())->get(Title::newFromText('PagePropsies')->toPageRecord(),ParserOptions::newCanonical())->getProperty('wikibase_item');
      • Using the API or looking directly in the DB we do not see the page property for example https://test.wikipedia.org/w/api.php?action=query&format=json&prop=pageprops&titles=PagePropsies
    • Some time later the value seems to appear in the page_props table (the parser output timestamp does not change here)

I have enquired to @daniel to see if this gap between generating ParserOutput with our property in it, and population of the page_props table is normal or to be expected (to try and fully understand the flow here).
@Krinkle also popped in on IRC and noted that we could be seeing a user view cause the initial parse? which would lead to the page_props population occur in a job after (todo check this)

Another thing that we then discussed was the fact that this dispatching process now only takes a few seconds.
It could be a reasonable assumption that the issue seen in T233520: Wikidata changes do not get sent to client sites on initial sitelink addition (in some cases), leading to things such as missing page props in page_props table is now caused by a race condition due to this fast process.
The ticket was created in 2019 and dispatching got fast in 2018 (see T205865: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch). Throughout multiple investigations we have not really found any other reason for this change in behaviour.

The possible race condition here would be:

  • Edit adding the sitelink to an Item happens on wikidata
  • wb_changes table is initially populated via notifyOnPageModified during the main request (see https://www.mediawiki.org/wiki/Manual:Hooks/RevisionFromEditComplete)
  • In ItemHandler::getSecondaryDataUpdates we add updates to call saveLinksOfItem or deleteLinksOfItem which can be executed in a deferred fashion (post request), and I believe this can also land in a job?

This can lead to wb_changes being written some period of time before sitelinks end up in the secondary store, which is where the client reads from to populate the page_properties of the client page.

A similar race condition probably leads to what we see in T248984: Client recentchanges entries sometimes don't have their wb_changes.change_id reference set
$changeStore->saveChange is called in onRecentChangeSave that adds recent change info to the already stored wb_changes row.
But this row may already have been read by the client due to fast dispatching, thus this data ends up missing from the client.

I chatted a little with @Krinkle and @daniel and ...

Firstly:

  • The purgeWebCache correctly triggers a UpdateHtmlCacheJob which runs quickly and "dirties" the parsercache for the page being dispatched to
  • This leads to cache misses when using a debug line like (MediaWiki\MediaWikiServices::getInstance()->getParserCache())->get(Title::newFromText('Category:Interior of Chapelle des Anges de Neunhoffen à Dambach (Bas-Rhin)')->toPageRecord(),ParserOptions::newCanonical())->getCacheTime();
  • A user view will parse the page correctly, and the page property for wikibase_item will be visible in the parser output with (MediaWiki\MediaWikiServices::getInstance()->getParserCache())->get(Title::newFromText('Category:Interior of Chapelle des Anges de Neunhoffen à Dambach (Bas-Rhin)')->toPageRecord(),ParserOptions::newCanonical())->getProperty('wikibase_item');
  • At this stage the page_prop will not be in the page_props table.

Secondly:

  • The scheduleRefreshLinks should trigger a RefreshLinksJob. (These take a little longer to run in most cases)
  • At some unknown time in the future the page_props table should be populated with the wikibase_item page property. (we saw this happen a few times)
  • You'd be able to see this in the DB, or using something like PageProps::getInstance()->getProperties([Title::newFromText('Category:Interior of Chapelle des Anges de Neunhoffen à Dambach (Bas-Rhin)')],'wikibase_item');

Now trying to apply this again to the random failures seen in T233520: Wikidata changes do not get sent to client sites on initial sitelink addition (in some cases), leading to things such as missing page props in page_props table where page_props is seemingly never populated....

One can make a WikiPageUpdater for a client with:

$services = MediaWiki\MediaWikiServices::getInstance();
$pu = new \Wikibase\Client\Changes\WikiPageUpdater( JobQueueGroup::singleton(), \Wikibase\Client\WikibaseClient::getLogger( $services ), $services->getStatsdDataFactory());

And then trigger a web cache job with something like:

$pu->purgeWebCache( [ Title::newFromText('Category:Interior of Chapelle des Anges de Neunhoffen à Dambach (Bas-Rhin)') ], ['rootJobTimestamp'=>wfTimestampNow()], 'update', 'uid:addshore' );

As with the bullet point lists above, we see the cache start missing at this point, and a web view will re parse the page.

When trying to do the same with the page for the refreshlinks job, I can do the following:

$pu->scheduleRefreshLinks( [ Title::newFromText('Category:Interior of Chapelle des Anges de Neunhoffen à Dambach (Bas-Rhin)') ], ['rootJobTimestamp'=>wfTimestampNow()], 'update', 'uid:addshore' );

In theory this should schedule the job, and no error occurs.
But in both of these job cases I do not see anything matching the job I scheduled in kafka (perhaps I'm looking in the wrong place).
I waited 10-30 minutes and did not see the page_props for a category page on commons end up updating. (could be the job was still in the queue?)

I then manually ran the job on the debug server:

(new RefreshLinksJob( Title::newFromText('Category:Interior of Chapelle des Anges de Neunhoffen à Dambach (Bas-Rhin)'), ['rootJobTimestamp'=>wfTimestampNow()] ))->run();

And the page_prop now instantly appears in the page_props table

>>> PageProps::getInstance()->getProperties([Title::newFromText('Category:Interior of Chapelle des Anges de Neunhoffen à Dambach (Bas-Rhin)')],'wikibase_item');
=> [
     103128055 => "Q106684122",
   ]

So either:

  1. The call to scheduleRefreshLinks is not actually scheduling a job? and thus the page_props table is never being updated?
  2. The jobs for the pages that appear to not have page_props just have not executed yet?
  3. something else?

Looking at https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=refreshLinks though these jobs shoul only take a few hours at most?
But really this is ~15 mins currently.
This lines up with what @Michael @Jakob_WMDE and I were seeing between testwiki and testwikidata, but not with what is seen in T233520 on commonswiki?

We can use page.page_links_updated to see when the page links / a refresh links job has last been run on the page!
(thanks again @Krinkle)
In PHP we can do WikiPage::factory(Title::newFromText('Category:Interior of Chapelle des Anges de Neunhoffen à Dambach (Bas-Rhin)'))->getLinksTimestamp()

If we run the job synchronously on a debug server, we can see the timestamp get bumped correctly.

>>> WikiPage::factory(Title::newFromText('Category:Interior of Chapelle des Anges de Neunhoffen à Dambach (Bas-Rhin)'))->getLinksTimestamp();
=> "20210505193823"
>>> (new RefreshLinksJob( Title::newFromText('Category:Interior of Chapelle des Anges de Neunhoffen à Dambach (Bas-Rhin)'), ['rootJobTimestamp'=>wfTimestampNow()] ))->run();
=> true
>>> WikiPage::factory(Title::newFromText('Category:Interior of Chapelle des Anges de Neunhoffen à Dambach (Bas-Rhin)'))->getLinksTimestamp();
=> "20210505195510"

Trying to schedule a job in the queue manually (and wait for it to execute) for the commons category:

>>> WikiPage::factory(Title::newFromText('Category:Interior of Chapelle des Anges de Neunhoffen à Dambach (Bas-Rhin)'))->getLinksTimestamp();
=> "20210505195510"
>>> wfTimestampNow();
=> "20210505195834"
>>> $pu->scheduleRefreshLinks( [ Title::newFromText('Category:Interior of Chapelle des Anges de Neunhoffen à Dambach (Bas-Rhin)') ], ['rootJobTimestamp'=>wfTimestampNow()], 'update', 'uid:addshore' );
=> null
....
>>> WikiPage::factory(Title::newFromText('Category:Interior of Chapelle des Anges de Neunhoffen à Dambach (Bas-Rhin)'))->getLinksTimestamp();
=> "20210506050112"

It takes a while and at the time of writing this comment I am still waiting for it to execute...

We could see the job enter the kafka queue:

addshore@stat1007:~$ kafkacat -b kafka-main1001.eqiad.wmnet -p 0 -t 'eqiad.mediawiki.job.refreshLinks' -o -1000000 | grep commons.wikimedia.org | grep Neunhoffen
{"$schema":"/mediawiki/job/1.0.0","meta":{"uri":"https://placeholder.invalid/wiki/Special:Badtitle","request_id":"766bd8c164f5e0761ffc3a89","id":"9f61fcb8-85db-4e66-906c-155a16fddb16","dt":"2021-05-05T19:58:17Z","domain":"commons.wikimedia.org","stream":"mediawiki.job.refreshLinks"},"database":"commonswiki","type":"refreshLinks","sha1":"3a14ea46843e2f7b3b687bdf45112fbca14ac89a","params":{"rootJobTimestamp":"20210505195817","causeAction":"update","causeAgent":"uid:addshore","namespace":14,"title":"Interior_of_Chapelle_des_Anges_de_Neunhoffen_à_Dambach_(Bas-Rhin)","requestId":"766bd8c164f5e0761ffc3a89"},"mediawiki_signature":"32c72f98713180f9aefc8059c33016bb221d6950"}

It took a while so at the same time I tried on testwiki:

>>> WikiPage::factory(Title::newFromText('PagePropsies'))->getLinksTimestamp();
=> "20210505124631"
>>> wfTimestampNow();
=> "20210505201007"
>>> $pu->scheduleRefreshLinks( [ Title::newFromText('PagePropsies') ], ['rootJobTimestamp'=>wfTimestampNow()], 'update', 'uid:addshore' );
=> null
...
>>> WikiPage::factory(Title::newFromText('PagePropsies'))->getLinksTimestamp();
=> "20210505202436"

So on testwiki this took 14 mins 30 seconds (roughly what @Jakob_WMDE @Michael & I saw testing).
And a commonswiki category took ~ 9 hours to get executed from the queue


Some more relevant conversation from IRC

9:37 PM <Pchelolo> addshore: is your job refreshLinks or refreshLinksPrioritized?
9:38 PM <addshore> How do I know that? I was assuming refreshLinks, but perhaps that's where I went wrong?
9:38 PM <Pchelolo> since those 2 have VERY different guarantees.. for prioritize one mean delay is a few seconds, while for regular it's 6 hours
9:45 PM <Pchelolo> kafkacat -b kafka-main1001.equad.wmnet
9:45 PM <Pchelolo> kafka-jumbo is an analytics cluster where jobs were replicated at some point, but ottomata is in the process of stopping the mirroring
9:47 PM <Pchelolo> so, with kafka-main you should see you job in non-prioritized non-partitioned topic. e.g kafkacat -b kafka-main1001.eqiad.wmnet:9092 -p 0 -t eqiad.mediawiki.job.refreshLinks
9:47 PM <Pchelolo> if you are sending it via wikibase WikiPageUpdater like in your paste on phab


So applying this to investigate another one of these seemingly more "randon" failures:

  • I grabbed a commons category page from https://www.wikidata.org/w/index.php?title=Special:Contributions&offset=20210503061116&limit=500&target=Pi+bot&namespace=all&tagfilter=&newOnly=1&start=&end=
  • 2 May 18:48 the sitelink was added https://www.wikidata.org/w/index.php?title=Q106686114&diff=1413255927&oldid=1413255916
  • I confirm with (MediaWiki\MediaWikiServices::getInstance()->getParserCache())->get(Title::newFromText('Category:Views_from_Palacio_Municipal_(Montevideo)')->toPageRecord(),ParserOptions::newCanonical())->getProperty('wikibase_item'); that the property exists in the parser output
  • I confirm with PageProps::getInstance()->getProperties([Title::newFromText('Category:Views_from_Palacio_Municipal_(Montevideo)')],'wikibase_item'); that no page_prop is currently in the db.
  • I look for the job from 3-4 days ago...
    • For refreshLinks
      • An offset of -50000000 starts the topic with dates from 2021-05-01T19:54:24Z which is enough time
      • So kafkacat -C -b kafka-main1001.eqiad.wmnet -p 0 -t 'eqiad.mediawiki.job.refreshLinks' -o -50000000 | grep commons.wikimedia.org | grep Montevideo | grep Palacio | grep Views should show us any refreshLinks jobs for this page that were triggered after the edits before reaching the end of the topic
      • I didn't find anything P15790
    • I should also find a html cache update job....
      • An offest of -17000000 gave me timestamps of around 2021-05-02T10:50:26Z so that should do as a starting point
      • So kafkacat -C -b kafka-main1001.eqiad.wmnet -p 0 -t 'eqiad.mediawiki.job.htmlCacheUpdate' -o -17000000| grep commons.wikimedia.org | grep Montevideo | grep Palacio | grep Views should show us any jobs for this page that were triggered...
      • I did find this job P15791

So it looks like the refreshLinks job is not being scheduled? (which would explain the missing page_props in the table even after waiting days...


Checking the job situation on another commons category that was recently edited...

https://www.wikidata.org/wiki/Q106685791
Sitelink was added on the 2nd of May after 17:00 https://www.wikidata.org/w/index.php?title=Q106685791&diff=1413223762&oldid=1413223751
To the commons category https://commons.wikimedia.org/wiki/Category:Views_from_the_IND_Fulton_Street_Line

$t = "Category:Views_from_the_IND_Fulton_Street_Line";
(MediaWiki\MediaWikiServices::getInstance()->getParserCache())->get(Title::newFromText($t)->toPageRecord(),ParserOptions::newCanonical())->getCacheTime();
(MediaWiki\MediaWikiServices::getInstance()->getParserCache())->get(Title::newFromText($t)->toPageRecord(),ParserOptions::newCanonical())->getProperty('wikibase_item');
PageProps::getInstance()->getProperties([Title::newFromText($t)],'wikibase_item');

Showed that apparently the parser cache entry exists from before the time of the edit?
And also that the page_props table was not populated...

>>> (MediaWiki\MediaWikiServices::getInstance()->getParserCache())->get(Title::newFromText($t)->toPageRecord(),ParserOptions::newCanonical())->getCacheTime();
=> "20210424003324"
>>> (MediaWiki\MediaWikiServices::getInstance()->getParserCache())->get(Title::newFromText($t)->toPageRecord(),ParserOptions::newCanonical())->getProperty('wikibase_item');
=> false
>>> PageProps::getInstance()->getProperties([Title::newFromText($t)],'wikibase_item');
=> []

Looking for the relevant jobs...

addshore@stat1007:~$ kafkacat -C -b kafka-main1001.eqiad.wmnet -p 0 -t 'eqiad.mediawiki.job.htmlCacheUpdate' -o -17000000 | grep commons.wikimedia.org | grep Views_from_the_IND_Fulton_Street_Line
% Reached end of topic eqiad.mediawiki.job.htmlCacheUpdate [0] at offset 5360231743
addshore@stat1007:~$ kafkacat -C -b kafka-main1001.eqiad.wmnet -p 0 -t 'eqiad.mediawiki.job.refreshLinks' -o -50000000 | grep commons.wikimedia.org | grep Views_from_the_IND_Fulton_Street_Line
% Reached end of topic eqiad.mediawiki.job.refreshLinks [0] at offset 16377843525

In this case, neither of the jobs were scheduled or run?

And then manually making these jobs from Wikibase PHP code on a debug server...

$services = MediaWiki\MediaWikiServices::getInstance();
$pu = new \Wikibase\Client\Changes\WikiPageUpdater( JobQueueGroup::singleton(), \Wikibase\Client\WikibaseClient::getLogger( $services ), $services->getStatsdDataFactory());
$pu->purgeWebCache( [ Title::newFromText('Category:Views_from_the_IND_Fulton_Street_Line') ], ['rootJobTimestamp'=>wfTimestampNow()], 'update', 'uid:addshore' );
$pu->scheduleRefreshLinks( [ Title::newFromText('Category:Views_from_the_IND_Fulton_Street_Line') ], ['rootJobTimestamp'=>wfTimestampNow()], 'update', 'uid:addshore' );

Looking for the by hand jobs

addshore@stat1007:~$ kafkacat -C -b kafka-main1001.eqiad.wmnet -p 0 -t 'eqiad.mediawiki.job.htmlCacheUpdate' -o -1000000 | grep commons.wikimedia.org | grep Views_from_the_IND_Fulton_Street_Line

{"$schema":"/mediawiki/job/1.0.0","meta":{"uri":"https://placeholder.invalid/wiki/Special:Badtitle","request_id":"5eea20a0a6fd2a89a842fde5","id":"619b6d62-ce04-473d-ac61-695ca21ee627","dt":"2021-05-05T22:48:47Z","domain":"commons.wikimedia.org","stream":"mediawiki.job.htmlCacheUpdate"},"database":"commonswiki","type":"htmlCacheUpdate","sha1":"6b6626bec7f564a9945e696aa5f92c93f49e27b1","root_event":{"signature":"params:ee961b0b6ee828a2b414c3d895df758dea65ec94","dt":"2021-05-05T22:48:47Z"},"params":{"pages":{"104243054":[14,"Views_from_the_IND_Fulton_Street_Line"]},"rootJobSignature":"params:ee961b0b6ee828a2b414c3d895df758dea65ec94","rootJobTimestamp":"20210505224847","causeAction":"update","causeAgent":"uid:addshore","namespace":-1,"title":"Badtitle/Wikibase\\Client\\Changes\\WikiPageUpdater","requestId":"5eea20a0a6fd2a89a842fde5"},"mediawiki_signature":"42cd76073d5b45e84255a2460d2bd4fb262556bc"}
% Reached end of topic eqiad.mediawiki.job.htmlCacheUpdate [0] at offset 5360234541
addshore@stat1007:~$ kafkacat -C -b kafka-main1001.eqiad.wmnet -p 0 -t 'eqiad.mediawiki.job.refreshLinks' -o -1000000 | grep commons.wikimedia.org | grep Views_from_the_IND_Fulton_Street_Line

{"$schema":"/mediawiki/job/1.0.0","meta":{"uri":"https://placeholder.invalid/wiki/Special:Badtitle","request_id":"5eea20a0a6fd2a89a842fde5","id":"f32e87fd-202b-493f-bc57-8b2edb857de1","dt":"2021-05-05T22:48:48Z","domain":"commons.wikimedia.org","stream":"mediawiki.job.refreshLinks"},"database":"commonswiki","type":"refreshLinks","sha1":"1bf24cbffca3ae3120f434f9167099eb1bea7ec3","params":{"rootJobTimestamp":"20210505224848","causeAction":"update","causeAgent":"uid:addshore","namespace":14,"title":"Views_from_the_IND_Fulton_Street_Line","requestId":"5eea20a0a6fd2a89a842fde5"},"mediawiki_signature":"88689ad9d37babfe1e5408b46a1620c315ddf645"}
% Reached end of topic eqiad.mediawiki.job.refreshLinks [0] at offset 16377856891

I was then going to check the wb_changes table for this edit, but unfortunately it seems to have happened just before the current cutoff...

mysql:research@dbstore1005.eqiad.wmnet [wikidatawiki]> select change_time from wb_changes ORDER BY change_time ASC limit 2;
+----------------+
| change_time    |
+----------------+
| 20210502224506 |
| 20210502224506 |
+----------------+
2 rows in set (0.001 sec)

But I could also look for the ChangeNotificationJob which would happen if the change did exist.... (in the topic eqiad.mediawiki.job.ChangeNotification in kafka)
An offest of -2000000 currently gives us timestamps of 20210430041324 which is a plenty far back enough time to start looking.
But these jobs only include the change ids.... not other info we can look for...


So these jobs (htmlCache & links update) are sometimes not getting into kafka.
Sometimes its one job? sometimes its both? and seemingly sometimes both get schedueled?
This may be an issue in Wikibase or in MediaWiki, but the issue seems to happen between the edit occurring, and job landing in kafka, so this could be:

  • wb_change table entries failing?
  • ChangeNotificationJob failing?
  • htmlCache / linksUpdate jobs failing?

It would probably be beneficial to add some more logging in Wikibase code around this area to see if we can get to the bottom of this...

Reproduction case:

  • Come up with some easy grepable string, such as "IMGREPABLE87654"
  • Create a new item on testwikidatawiki with this string as the label
  • Create a new page on testcommonswiki with this string as the title
  • Close the page on testcommonswiki and do not re open it or query the api etc about this page at all
  • Add a sitelink to testcommonswiki and the page you create
  • Wait and do not touch anything for some time.
  • You have reproduced the issue

Reproduction verification:

  • You can see that a wb_changes entry was made for the item using something like select * from wb_changes_subscription where cs_subscriber_id = "testcommonswiki" AND cs_entity_id = "Q215220";
  • You can assume that the dispatch process did look at this wb_changes entry (not code snippet provided but I tested this separately)
  • You can see that no ChangeNotificationJob was scheduled for the entry in wb_changes using something like kafkacat -C -b kafka-main1001.eqiad.wmnet -p 0 -t 'eqiad.mediawiki.job.ChangeNotification' -o -10000 | grep test | grep 547009 where the last grep is the ID of the wb_changes row.
  • As this job was not scheduled you would also not expect to see the html cache update job or the links update job.
  • You can confirm that a html cache update job has not run, as the page still has a parser cache entry (MediaWiki\MediaWikiServices::getInstance()->getParserCache())->get(Title::newFromText($t)->toPageRecord(),ParserOptions::newCanonical())->getCacheTime(); where $t is the page title.
  • You can also confirm nothing is in the page_props table ageProps::getInstance()->getProperties([Title::newFromText($t)],'wikibase_item');
  • You can also confirm that the page on testcommonswiki is NOT subscribed to the Item select * from wb_changes_subscription where cs_subscriber_id = "testcommonswiki" AND cs_entity_id = "Q215220";

The issue
The dispatch process relies on clients being subscribed to entities in order for the ChangeNotificationJobs to be sent.
This can be seen in https://github.com/wikimedia/Wikibase/blob/master/repo/includes/ChangeDispatcher.php#L400
Which ultimately ends up querying this wb_changes_suvscription table
https://github.com/wikimedia/Wikibase/blob/33c8fb8a6fd09d15cc7a3c47bdc6c60e398ce0e2/client/includes/Usage/Sql/SqlSubscriptionManager.php#L100
This is ONLY written to by a job on client, on parser output save
https://github.com/wikimedia/Wikibase/blob/33c8fb8a6fd09d15cc7a3c47bdc6c60e398ce0e2/client/includes/Hooks/DataUpdateHookHandler.php#L195
Thus if either of the following are not the case, the dispatch process will not happen for the client page:

  • The page is not already subscribed for some reason
  • The page does not get it's parser output generated for some reason between the wikidata edit and the dispatch process processing the change

Making the process work with 1 more edit:

  • It's now time to load the commons page again and purge the page with action=purge
  • This will trigger the job adding the subscription, which can be seen with something like select * from wb_changes_subscription where cs_subscriber_id = "testcommonswiki" AND cs_entity_id = "Q215220";
  • We can now make another edit to the item (adding a description for example) https://test.wikidata.org/w/index.php?title=Q215220&diff=540854&oldid=540853
  • Which should make a new change that we can grep for a job for and see such as kafkacat -C -b kafka-main1001.eqiad.wmnet -p 0 -t 'eqiad.mediawiki.job.ChangeNotification' -o -100000 | grep 547010
  • This will in turn schedule the 2 jobs we expect (htmlcacheupdate and links update)
  • We can see that these have happened as now the parsercache timestamp will be bumped again, and the page_props table will be updated shortly ( a few mins at most)

Suspicion
This code has existed in this way forever (ish)
At some point Wikibase used to fully purge the pages on client, maybe even in a slightly different place in code? at some point this changed, and this flow and assumptions of order was broken somehow?

How I got here
I came across this reproduction after finding another commons category to look at that was recently added as a sitelink to an item on wikidata, using select * from wb_items_per_site where ips_site_id = "commonswiki" order by ips_row_id DESC limit 1000;
I noticed in this case that the ChangeNotificationJob for commonswiki was not being triggered, but there was a job for another site.
So the dispatch process was looking at the change, as deciding not to send a job to commons.

Wikidata Edit: https://www.wikidata.org/w/index.php?title=Q104008364&diff=1416556742&oldid=1320773972
Page on commons https://commons.wikimedia.org/wiki/Category:Neo-Baroque_architecture_in_Prague
Api on commons https://commons.wikimedia.org/w/api.php?action=query&format=json&prop=pageprops&titles=Category:Neo-Baroque%20architecture%20in%20Prague
I did not find htmlCacheUpdate or refreshLinks for the page
I found the wb_changes ID was 1409977147
And I found 1 ChangeNotificationJob for the edit, but to a different client only (not commons)
And I then found that the wikibase-addUsagesForPage job that added the entry to wb_changes_subscription happened many hours after the dispatch process happened (probably when I loaded the page to look at it to see what was going on)

addshore@stat1007:~$ kafkacat -C -b kafka-main1001.eqiad.wmnet -p 0 -t 'eqiad.mediawiki.job.ChangeNotification' -o -1000000 | grep 1409977147
{"$schema":"/mediawiki/job/1.0.0","meta":{"uri":"https://placeholder.invalid/wiki/Special:Badtitle","request_id":"a5a3d4be3e34917eeead065e","id":"24c12a48-2087-41b9-869b-f4dd3768a9fb","dt":"2021-05-07T09:02:47Z","domain":"cs.wikipedia.org","stream":"mediawiki.job.ChangeNotification"},"database":"cswiki","type":"ChangeNotification","params":{"repo":false,"changeIds":[1409977147,1409977151],"rootJobTimestamp":"20210507090247"},"mediawiki_signature":"0302de8979c4b7e71ecfaa84feaa01dbb7129d3d"}
% Reached end of topic eqiad.mediawiki.job.ChangeNotification [0] at offset 319689297
% Reached end of topic eqiad.mediawiki.job.ChangeNotification [0] at offset 319689298

Will wait until next week for @Jakob_WMDE and or @Michael to confirm this reproduction, then I think we could close the investigation and come up with a plan of attack

Another learning.
Chained greps on kafkacat seem to break and give different results, so should be avoided...

See below where I would expect the job to be shown in both greps

addshore@stat1007:~$ kafkacat -C -b kafka-main1001.eqiad.wmnet -p 0 -t 'eqiad.mediawiki.job.ChangeNotification' -o -100000 | grep 858d2fa00061557d1f5f64a2b650c7b61b604994
{"$schema":"/mediawiki/job/1.0.0","meta":{"uri":"https://placeholder.invalid/wiki/Special:Badtitle","request_id":"59d70f2de736400dbaf3d953","id":"58fd9760-325d-426e-9f39-a0d3b20d9e88","dt":"2021-05-07T19:44:22Z","domain":"test-commons.wikimedia.org","stream":"mediawiki.job.ChangeNotification"},"database":"testcommonswiki","type":"ChangeNotification","params":{"repo":false,"changeIds":[547010],"rootJobTimestamp":"20210507194421"},"mediawiki_signature":"858d2fa00061557d1f5f64a2b650c7b61b604994"}
% Reached end of topic eqiad.mediawiki.job.ChangeNotification [0] at offset 319708891
^C
addshore@stat1007:~$ kafkacat -C -b kafka-main1001.eqiad.wmnet -p 0 -t 'eqiad.mediawiki.job.ChangeNotification' -o -100000 | grep 858d2fa00061557d1f5f64a2b650c7b61b604994 | grep test
% Reached end of topic eqiad.mediawiki.job.ChangeNotification [0] at offset 319708898

I couldn't resist poking @daniel with the findings.

9:34 PM <duesen> I'm trying to remember whether saving the wikidata page would add subscriptions for all sites that have sitelinks.
9:35 PM <addshore> onParserCacheSaveComplete
9:36 PM <addshore> another possibility is that we used to dispatch to all sites that had sitelinks or something?
9:36 PM <duesen> I think we forced dispatching to the site that was just added, if the event was "add a sitelink".

We went digging around and found exactly what daniel was referring to

9:44 PM <addshore> So that code path depends on this change being an ItemChange and i dont see where that is made right now
9:45 PM <duesen> I think it'S because only ItemChanges have site links. Or at least, that's how it used to be.
9:46 PM <duesen> Wiring for 'WikibaseClient.EntityChangeFactory' puts a mapping into EntityChangeFactory, which has Item::ENTITY_TYPE => ItemChange::class,
9:47 PM <addshore> so the only other thing I see that could go wrong is this diff condition
9:47 PM <addshore> https://github.com/wikimedia/Wikibase/blob/33c8fb8a6fd09d15cc7a3c47bdc6c60e398ce0e2/repo/includes/ChangeDispatcher.php#L362
9:54 PM <addshore> i got it
9:55 PM <addshore> >>> MediaWiki\MediaWikiServices::getInstance()->getService('WikibaseRepo.Store')->getEntityChangeLookup()->loadByChangeIds([547009])[0]
9:55 PM <addshore> => Wikibase\Repo\Notifications\RepoItemChange {#5576}
9:55 PM <addshore> But the condition is if ( $change instanceof ItemChange ) {

The object being returned by the factory was changed https://github.com/wikimedia/Wikibase/commit/d05a0c7ee31f62dd00de583646c45470891f5204#diff-ee6dbb8537c6e9d555c439b164d9ccc2e4d40aedcff12ec86331e5c5637bf7ebR748
but this condition in dispatching was not changed.

Change 686768 had a related patch set uploaded (by Addshore; author: Addshore):

[mediawiki/extensions/Wikibase@master] Fix: When a sitelink change happens, always dispatch to the site

https://gerrit.wikimedia.org/r/686768

Moving the investigation to peer review.
Though I think the parent ticket should be picked up "in full" now for a proper fix and tests / cleanup

Another learning.
Chained greps on kafkacat seem to break and give different results, so should be avoided...

9:23 PM <addshore> Is there something I should know about grepping the output of kafkacat twice? I seem to not get matches if I pipe it through 2 greps? https://phabricator.wikimedia.org/T280627#7070564
9:39 PM <ebernhardson> addshore: i'd guess something about flushing, try --line-buffered on the first grep
10:15 PM <addshore> ebernhardson: yup, that solves it!

Looks like this investigation can probably be moved to done then? Or are we waiting for a second reproduction on test? @Jakob_WMDE / @Michael

Jakon on mattermost:

amazing, I can even reproduce it locally:

create client page
create repo item
add sitelink from repo to the client page. make sure not to touch the client page after this.
wb_changes entries appear
dispatchChanges.php does nothing, runJobs.php --wiki client does nothing
edit the client page
dispatching suddenly works

Change 686768 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Fix: When a sitelink change happens, always dispatch to the site

https://gerrit.wikimedia.org/r/686768