Page MenuHomePhabricator

Wikidata changes do not get sent to client sites on initial sitelink addition (in some cases), leading to things such as missing page props in page_props table
Closed, ResolvedPublic

Description

https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/501444/ seems to have changed the behaviour of the links update job to use the parser cache when a page has an entry.
As a result of this when a wikibase repo (wikidata) has an edit that it dispatches to a client to update its pages sometimes ContentAlterParserOutput will not run, which is currently what adds the wikibase_item page prop.

After a little more digging, it looks like perhaps the behaviour actually changed at the end of 2018 in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/465157/
At least in this patch it looks as though refreshlinks requests the parser output from RevisionRendered with with generate-html set to the result of shouldCheckParserCache, which I believe would be false for a client page having a links update triggered by wikidata dispatching.

This will not happen to pages that get updated that do not have a parser cache entry.
All pages can be fixed with a null edit.
I also assume when a parser cache entry expires the new entry will update the page props correctly (TODO verify / confirm?)

Possible solutions

  • Change where this (and potentially other) updates happen.
  • Allow the wikibase update process to ignore the parser cache in the job
  • Something else?

Bug report

Title: page_props missing links for some Commons category <-> Wikidata sitelinks

Some Commons categories are linked to Wikidata, but are missing from the page_props table in the database. Examples:

https://commons.wikimedia.org/wiki/Category:Broadway_East,_Baltimore
https://commons.wikimedia.org/wiki/Category:Buddhist_temples_in_Lamphun_Province
https://commons.wikimedia.org/wiki/Category:Buddhist_temples_in_Ubon_Ratchathani_Province
https://commons.wikimedia.org/wiki/Category:Civil_law_notaries
https://commons.wikimedia.org/wiki/Category:Climate_change_conferences
https://commons.wikimedia.org/wiki/Category:Former_components_of_the_Dow_Jones_Industrial_Average
https://commons.wikimedia.org/wiki/Category:Dukes_of_the_Archipelago
https://commons.wikimedia.org/wiki/Category:Eastern_Catholic_orders_and_societies
https://commons.wikimedia.org/wiki/Category:English_people_of_Turkish_descent

Lucas confirmed this on Twitter https://twitter.com/LucasWerkmeistr/status/1175747434208727040 and there's discussion at https://www.wikidata.org/wiki/User_talk:Jheald#Quarry_oddity . It's not clear why this is happening - the examples I've given are from categories I've found when looking through commons links from enwp (in alphabetical order, hence B-E).

Acceptance criteria

  • page_props are always updated for client page when sitelink is added to the repository

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Okay, I spent several hours debugging and trying to figure out what's going on and how I0cd13c424 broke it.

Spoiler alert: It's not logic in the hook or issues with virtual usages.

Let's break down the steps to reproduce it:

  • Create a page on wikipedia that includes some data from a statement on wikidata item
    • This creates a ParserCache entry without the wikidata data (stuff that goes into page_props) because it's not yet connected to Wikidata
  • Add a sitelink from the item to the page
    • This happens in Wikidata, it means it triggers a row in wb_changes table that later gets dispatched to the client wiki
    • The dispatched script, sees a need to run refreshLinks on that page and thus enqueues ChangeNotification job
    • The ChangeNotification job enqueues a refreshLinks job
    • The great refreshLinks job needs a ParserOutput object, it tries to load it but with the change merged it tries to load PO object from PC:
      • Oh there's one PC object already for when the page was made originally and my rootJobTimestamp + 10 seconds is still younger than the original PC
      • Look, the revision id hasn't changed since then, same as page_touched
      • Cool, let's just use the cached PO instead

As the result:

  • The page_props will not be updated for the page

The code at fault is this:

	/**
	 * Get the parser output from cache if it reflects the change that triggered this job
	 *
	 * @param ParserCache $parserCache
	 * @param WikiPage $page
	 * @param RevisionRecord $currentRevision
	 * @param StatsdDataFactoryInterface $stats
	 * @return ParserOutput|null
	 */
	private function getParserOutputFromCache(
		ParserCache $parserCache,
		WikiPage $page,
		RevisionRecord $currentRevision,
		StatsdDataFactoryInterface $stats
	) {
		$cachedOutput = null;
		// If page_touched changed after this root job, then it is likely that
		// any views of the pages already resulted in re-parses which are now in
		// cache. The cache can be reused to avoid expensive parsing in some cases.
		$rootTimestamp = $this->params['rootJobTimestamp'] ?? null;
		if ( $rootTimestamp !== null ) {
			$opportunistic = !empty( $this->params['isOpportunistic'] );
			if ( $opportunistic ) {
				// Neither clock skew nor DB snapshot/replica DB lag matter much for
				// such updates; focus on reusing the (often recently updated) cache
				$lagAwareTimestamp = $rootTimestamp;
			} else {
				// For transclusion updates, the template changes must be reflected
				$lagAwareTimestamp = wfTimestamp(
					TS_MW,
					wfTimestamp( TS_UNIX, $rootTimestamp ) + self::NORMAL_MAX_LAG
				);
			}

			if ( $page->getTouched() >= $rootTimestamp || $opportunistic ) {
				// Cache is suspected to be up-to-date so it's worth the I/O of checking.
				// As long as the cache rev ID matches the current rev ID and it reflects
				// the job's triggering change, then it is usable.
				$parserOptions = $page->makeParserOptions( 'canonical' );
				$output = $parserCache->getDirty( $page, $parserOptions );
				if (
					$output &&
					$output->getCacheRevisionId() == $currentRevision->getId() &&
					$output->getCacheTime() >= $lagAwareTimestamp
				) {
					$cachedOutput = $output;
				}
			}
		}

		if ( $cachedOutput ) {
			$stats->increment( 'refreshlinks.parser_cached' );
		} else {
			$stats->increment( 'refreshlinks.parser_uncached' );
		}

		return $cachedOutput;
	}

I was able to confirm that this patch causes it by making the cache function return null all the time and saw the records appearing in page_props.

Options:

  • Introduce an opposite of isOpportunistic option and make the refreshlinks jobs enqueued by wikibase to set that.
    • Side note: isOpportunistic is a pretty vague name, what does an "opportunistic" job do? It's also completely undocumented. It's also a string flag, very easily typo-able.
  • Make the refreslinks job run the hook when it gets the cached PC. You might need to move it to a public function and invoke it there. It's rather complicated, you need the base content object there.
  • Completely drop the whole logic of "let's grab the cached version first". The stats says it's used quite a lot but I think most of them is indeed Wikibase jobs which also makes the first option and this option the same performance-wise.

Once again, the code base seems extremely convoluted and complex, I'm not sure if this complexity is the only option given our complex infra or it can be improved.

cc @aaron @tstarling

Updating page touched would also work?

  • updating page_touched would probably trigger a reparse of the page (and currently that doesn't happen for the case described)

In practical terms, WikiPageUpdater should probably just call WikiPage::doPurge(), instead of trying to duplicate the functionality.

This would update page touched via Title::invalidateCache as far as I can see.

Updating page touched would also work?

  • updating page_touched would probably trigger a reparse of the page (and currently that doesn't happen for the case described)

In practical terms, WikiPageUpdater should probably just call WikiPage::doPurge(), instead of trying to duplicate the functionality.

This would update page touched via Title::invalidateCache as far as I can see.

This already happens (changing page_touched) through HTMLCacheUpdateJob and that's one of the reasons the problem exists, because the try_to_get_cached_parser_output method (the aforementioned method I pasted) checks if the page_touched is newer than the rootTimeStamp which by definition is (root job is the change notification one that spawns HTMLCacheUpdateJob)

Moving to waiting, waiting on core platform again now I guess

This already happens (changing page_touched) through HTMLCacheUpdateJob and that's one of the reasons the problem exists, because the try_to_get_cached_parser_output method (the aforementioned method I pasted) checks if the page_touched is newer than the rootTimeStamp which by definition is (root job is the change notification one that spawns HTMLCacheUpdateJob)

So the solution would be for the LinksUpdate job to not inherit the root timestamp?
If that's not possible for some reason, LinksUpdate could have a flag for suppressing the opportunistic behavior, or for providing a different timestamp.

Change 574868 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@master] Remove rootJobTimestamp from refreshLinks jobs in client

https://gerrit.wikimedia.org/r/574868

This already happens (changing page_touched) through HTMLCacheUpdateJob and that's one of the reasons the problem exists, because the try_to_get_cached_parser_output method (the aforementioned method I pasted) checks if the page_touched is newer than the rootTimeStamp which by definition is (root job is the change notification one that spawns HTMLCacheUpdateJob)

So the solution would be for the LinksUpdate job to not inherit the root timestamp?
If that's not possible for some reason, LinksUpdate could have a flag for suppressing the opportunistic behavior, or for providing a different timestamp.

That's a rather easy way to fix it, yes.

Change 574868 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Remove rootJobTimestamp from refreshLinks jobs in client

https://gerrit.wikimedia.org/r/574868

Can someone confirm this is working for you now please?

Can someone confirm this is working for you now please?

It will go live tomorrow evening. Then it can be checked.

Possibly this is still happening, see https://commons.wikimedia.org/wiki/Category:Serangoon_Secondary_School - sitelink was added on Wikidata on 14 August, but the query in https://bitbucket.org/mikepeel/wikicode/src/master/query_infobox_candidates.py is not retrieving it. Could someone check this example please?

Interestingly for one of the pages you link to the page_prop now appears, for the other it does not.

# https://commons.wikimedia.org/w/index.php?title=Category:Serangoon_Secondary_School&action=info
mysql:research@dbstore1004.eqiad.wmnet [commonswiki]> select * from page_props where pp_page = 5982384;
Empty set (0.001 sec)

# https://commons.wikimedia.org/w/index.php?title=Category:J%C3%BCdenstra%C3%9Fe_(Gotha)&action=info
mysql:research@dbstore1004.eqiad.wmnet [commonswiki]> select * from page_props where pp_page = 39602411;
+----------+---------------+------------+------------+
| pp_page  | pp_propname   | pp_value   | pp_sortkey |
+----------+---------------+------------+------------+
| 39602411 | wikibase_item | Q100535082 |       NULL |
+----------+---------------+------------+------------+

It looks like the one that does have a page_prop now exist was edited 2 days after your comment.

I've been adding infoboxes on Commons by finding the categories a different way (walking the category tree), when that edit is done it seems to fix the problem. However, I can't always do that, since I'm using page_props to find the candidates to add infoboxes to! See related discussion at https://commons.wikimedia.org/w/index.php?title=User_talk:Mike_Peel&oldid=506602467

I think this is another issue then the one originally reported (the end-user issue is the same though). I think we should open a new ticket.

I think this is another issue then the one originally reported (the end-user issue is the same though). I think we should open a new ticket.

I'm happy for this task to be split if that helps solve it! Another example seems to be https://commons.wikimedia.org/wiki/Category:Serra_da_Paulista

Just to note that this is still happening a lot - seemingly when the sitelink is added on Wikidata and the article has not been edited since. It is easily solved by null edits, but I am having to do quite a lot of them!

Just to note that this is still happening a lot - seemingly when the sitelink is added on Wikidata and the article has not been edited since. It is easily solved by null edits, but I am having to do quite a lot of them!

Thanks for touching back @Mike_Peel
Have you got any links to titles / edits recently so that I can have another dig through logs etc?

Thanks for touching back @Mike_Peel
Have you got any links to titles / edits recently so that I can have another dig through logs etc?

How many links would you like? :-) One example is https://www.wikidata.org/wiki/Q105740303 (bot-created by pi bot/pywikibot a few days ago) vs. https://commons.wikimedia.org/wiki/Category:Views_of_Castillo_San_Felipe_de_Barajas (the infobox would have been added by the bot if page_props was set). It created quite a few similar items in that series of edits, see https://www.wikidata.org/w/index.php?title=Special:Contributions&offset=20210303003736&limit=500&target=Pi+bot&namespace=all&tagfilter=&newOnly=1&start=&end= - only a few have page_props set.

I can give examples for sitelinks created by others on commons/enwp/dewp/elsewhere if you want, but I'll need to turn off the null edit part of my scripts to preserve the examples...

Addshore added a project: wdwb-tech.
Addshore moved this task from Inbox to Research on the wdwb-tech board.

So this can still be reproduced on testwiki and testwikidata using the suggested reproduction path in T233520#5805154

Thus I'm not convinced that the change in https://gerrit.wikimedia.org/r/c/574868 fixes the whole problem and we probably need to take another dive into this.

@Lydia_Pintscher not sure how this one should get prioritized?

Can we timebox an investigation to say 2 days?

@Mike_Peel I'm wondering if you by chance have any links to recent examples again (indeed with no null edit being dont ideally)

(in that second link, look at items from 'Category:Views from 10 Upper Bank Street (Q106685873)' and earlier.)

Change 686768 had a related patch set uploaded (by Addshore; author: Addshore):

[mediawiki/extensions/Wikibase@master] Fix: When a sitelink change happens, always dispatch to the site

https://gerrit.wikimedia.org/r/686768

Addshore renamed this task from page_props wikibase_item is sometimes not added to client pages when a sitelink is added on a repo to Wikidata changes do not get sent to client sites on initial sitelink addition (in some cases), leading to things such as missing page props in page_props table.May 7 2021, 9:27 PM
Addshore added subscribers: jeblad, Nirmos.
Addshore added subscribers: Tagishsimon, PeterBowman.

The investigation at in T280627 seems to have got to the bottom of a cause for this.
Now that a cause seems to have been identified I think we can safely merge these 2 other tickets into here, as this is not a commons only issue it seems, and this has been broken for ~ 10 months.

Change 686691 had a related patch set uploaded (by Addshore; author: Addshore):

[mediawiki/extensions/Wikibase@master] Revert "Remove rootJobTimestamp from refreshLinks jobs in client"

https://gerrit.wikimedia.org/r/686691

Change 686768 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Fix: When a sitelink change happens, always dispatch to the site

https://gerrit.wikimedia.org/r/686768

Addshore claimed this task.

I'm sure happy to say that this should no longer be happening :)
Thanks to all of those involved in investigating and figuring this out.
And thanks to @Mike_Peel for helping us stay on top of production cases of the issue!

Thanks for fixing it! From some initial checks, all looks good to me. Pi bot will next bulk-create items on the 1st June, so I'll be able to check then whether everything is fixed (if you don't hear me complaining around then, it's good news. ;-) )

@Addshore OK, this is odd now. Take for example https://commons.wikimedia.org/wiki/Category:IMO_9869162 - for which Pi bot created https://www.wikidata.org/wiki/Q107073476 on 2 June. On Commons you can see the Wikidata link, which you couldn't before. However, Pi bot is still not seeing this and adding the infobox, using the query in:
https://bitbucket.org/mikepeel/wikicode/src/master/query_infobox_candidates.py
which I run nightly on toolforge.

So it seems that the situation has *improved* but is not yet *fixed*. Unless I'm doing something wrong in my query?

Change 721083 had a related patch set uploaded (by Addshore; author: Addshore):

[mediawiki/extensions/Wikibase@REL1_36] Fix: When a sitelink change happens, always dispatch to the site

https://gerrit.wikimedia.org/r/721083

Change 721083 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@REL1_36] Fix: When a sitelink change happens, always dispatch to the site

https://gerrit.wikimedia.org/r/721083