Page MenuHomePhabricator

page_props wikibase_item is sometimes not added to client pages when a sitelink is added on a repo
Open, HighPublic

Description

https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/501444/ seems to have changed the behaviour of the links update job to use the parser cache when a page has an entry.
As a result of this when a wikibase repo (wikidata) has an edit that it dispatches to a client to update its pages sometimes ContentAlterParserOutput will not run, which is currently what adds the wikibase_item page prop.

After a little more digging, it looks like perhaps the behaviour actually changed at the end of 2018 in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/465157/
At least in this patch it looks as though refreshlinks requests the parser output from RevisionRendered with with generate-html set to the result of shouldCheckParserCache, which I believe would be false for a client page having a links update triggered by wikidata dispatching.

This will not happen to pages that get updated that do not have a parser cache entry.
All pages can be fixed with a null edit.
I also assume when a parser cache entry expires the new entry will update the page props correctly (TODO verify / confirm?)

Possible solutions

  • Change where this (and potentially other) updates happen.
  • Allow the wikibase update process to ignore the parser cache in the job
  • Something else?

Bug report

Title: page_props missing links for some Commons category <-> Wikidata sitelinks

Some Commons categories are linked to Wikidata, but are missing from the page_props table in the database. Examples:

https://commons.wikimedia.org/wiki/Category:Broadway_East,_Baltimore
https://commons.wikimedia.org/wiki/Category:Buddhist_temples_in_Lamphun_Province
https://commons.wikimedia.org/wiki/Category:Buddhist_temples_in_Ubon_Ratchathani_Province
https://commons.wikimedia.org/wiki/Category:Civil_law_notaries
https://commons.wikimedia.org/wiki/Category:Climate_change_conferences
https://commons.wikimedia.org/wiki/Category:Former_components_of_the_Dow_Jones_Industrial_Average
https://commons.wikimedia.org/wiki/Category:Dukes_of_the_Archipelago
https://commons.wikimedia.org/wiki/Category:Eastern_Catholic_orders_and_societies
https://commons.wikimedia.org/wiki/Category:English_people_of_Turkish_descent

Lucas confirmed this on Twitter https://twitter.com/LucasWerkmeistr/status/1175747434208727040 and there's discussion at https://www.wikidata.org/wiki/User_talk:Jheald#Quarry_oddity . It's not clear why this is happening - the examples I've given are from categories I've found when looking through commons links from enwp (in alphabetical order, hence B-E).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 22 2019, 1:07 PM
Jheald renamed this task from page_props missing for some Commons categories to page_props missing links for some Commons category <-> Wikidata sitelinks.Sep 27 2019, 12:04 PM

From the tweet:

mysql:research@dbstore1004.eqiad.wmnet [commonswiki]> SELECT *
    -> FROM page_props
    -> WHERE pp_page = (
    ->   SELECT page_id
    ->   FROM page
    ->   WHERE page_title = 'Broadway_East,_Baltimore'
    ->   AND page_namespace = 14
    -> );

SELECEmpty set (0.00 sec)

mysql:research@dbstore1004.eqiad.wmnet [commonswiki]>
mysql:research@dbstore1004.eqiad.wmnet [commonswiki]> -- sanity check
mysql:research@dbstore1004.eqiad.wmnet [commonswiki]> SELECT page_id
    -> FROM page
    -> WHERE page_title = 'Broadway_East,_Baltimore'
    -> AND page_namespace = 14;
+----------+
| page_id  |
+----------+
| 62358461 |
+----------+
1 row in set (0.00 sec)

Performing null edits on the pages will fix them.

If null edits work, and this isn't a problem where new cases will be created in the future, then I could run a bot through them to make null edits - is there is a list of all affected pages?

I have fixed all of the examples in the description.
I'll do a little bit more digging into this.

Addshore claimed this task.Jan 15 2020, 9:56 AM
Addshore moved this task from incoming to needs discussion or investigation on the Wikidata board.

and this isn't a problem where new cases will be created in the future

That's what I want to do a little more investigation for, and also see if I can come up with a query to see if there are already other cases that we havn't found.

Restricted Application added a project: User-Addshore. · View Herald TranscriptJan 15 2020, 9:56 AM
Addshore triaged this task as Low priority.Jan 15 2020, 9:56 AM
Addshore raised the priority of this task from Low to High.Jan 15 2020, 10:49 AM
Addshore moved this task from Incoming to Needs Work on the Wikidata-Campsite board.

Looks like this might be a regression for Wikidata after https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/501444/ was done.

That commit states:

  • Make the logic to use cached output actually work. This was also broken since DerivedPageDataUpdater was added. In order to pass the output, add a known-revision-output parameter to both WikiPage::doSecondaryUpdates() and DerivedPageDataUpdater::prepareUpdate().

Which probably means while edits happen on Wikidata the parser output on clients is not updated, when an entry in the parser cache exists for the page, as the `ContentAlterParserOutput ` hook will not run.

Addshore renamed this task from page_props missing links for some Commons category <-> Wikidata sitelinks to page_props wikibase_item is sometimes not added to client pages when a sitelink is added on a repo.Jan 15 2020, 11:01 AM
Addshore updated the task description. (Show Details)
Addshore updated the task description. (Show Details)
Addshore added subscribers: aaron, tstarling, daniel, Krinkle.
daniel added a comment.EditedJan 15 2020, 11:35 AM

I haven't done a full analysis, but I'm wondering why this is using the ContentAlterParserOutput hook, instead of RevisionDataUpdates? Conceptually, ContentAlterParserOutput is a filter-style hook that shouldn't have side effects. RevisionDataUpdates is specifically designed for recording information into the database after an edit or re-parse.

EDIT: I think I got this wrong. RevisionDataUpdates is for transferring information from ParserOutput into the database. But this issue is about the opposite - transferring information (namely, the linked item) from the database into the ParserOutput, right?
ContentAlterParserOutput sounds right for that. I'll have to look again.

Addshore updated the task description. (Show Details)Jan 15 2020, 12:14 PM
Addshore updated the task description. (Show Details)Jan 15 2020, 12:19 PM

So, after a little more investigation and reproduction on testwikipedia nd test wikidata

  • Create a page on wikipedia that includes some data data from a statement on wikidata item
  • The wikipedia page will be parsed and rendered with the data from wikidata
  • Add a sitelink from the item to the page
  • The page_props will not be updated for the page
  • Alter the statement that is being included on the page by editing wikidata
  • keep refreshing the page on the wikipedia and the data included on the page will be updated
  • page_props will still not be updated
  • Perform a null edit on the page
  • page_props are finally updated.

Adding core platform team to get some feedback and figure out:

  • When exactly this change in behaviour happened?
    • I have looked through various gerrit changes and code paths but i'm sure someone that has worked on this code would be able to figure out exactly when this happened.
    • It's slightly confusing as some stages look like a partial implementation, in some places, that may not have worked as expected etc.
  • Does it make sense currently that our property parser function is updated on the page, but the stuff in the ContentAlterParserOutput is not done?
  • Is RevisionDataUpdates the right way to go for the page_props update?

Thank you, @Addshore, for investigating this issue!

daniel added a subscriber: Anomie.Jan 16 2020, 9:39 AM

Adding core platform team to get some feedback and figure out:

  • When exactly this change in behaviour happened?
    • I have looked through various gerrit changes and code paths but i'm sure someone that has worked on this code would be able to figure out exactly when this happened.
    • It's slightly confusing as some stages look like a partial implementation, in some places, that may not have worked as expected etc.

To be honest, I find this code rather confusing myself. Maybe @Anomie or @aaron can help.

  • Does it make sense currently that our property parser function is updated on the page, but the stuff in the ContentAlterParserOutput is not done?

That seems wrong. If a new ParserOutput is generated and cached, ContentAlterParserOutput should run. Since it is triggered by Content::getParserOutput, I can't imagine how the PO would get into the ParserCache without this hook running.

Maybe it does run, but doesn't do the deed?

  • Is RevisionDataUpdates the right way to go for the page_props update?

No. RevisionDataUpdates is the right place to pull things from ParserOutput and write it to the database. But for page_props, that's already done by the standard update. If the data is in ParserOutput (via setProperty), it should be recorded in the database.

*poke* @Anomie and @aaron to take a look at this.

I’ve just read a (ca. three months old) report mentioning that Yair rand’s WikidataInfo.js tells after connecting that no Wikidata item found. This script queries directly the Wikidata API, so the bug may be that the information gets lost in the Wikibase repo, before the client had any chance to process it. Is it possible?

Poke @CCicalese_WMF This is a pretty large bug on Wikidata.org currently still waiting for feedback from Core Platform Team
Will make sure to add this to SOS(scrum of scrums) again this week.

Thanks for the reminder. I will check on this.

Addshore removed Addshore as the assignee of this task.Mon, Jan 27, 3:46 PM
Addshore moved this task from Needs Work to Needs Tech Work on the Wikidata-Campsite board.

@Addshore Investigating this requires a dev setup with Wikibase. I don't have one handy... I think it would be best if we could spend an hour or so investigating this together. Maybe via hangout next week? Send me an invite...

@Addshore Investigating this requires a dev setup with Wikibase. I don't have one handy... I think it would be best if we could spend an hour or so investigating this together. Maybe via hangout next week? Send me an invite...

Scheduled :)

So we had a call and determined that:

  • We really don't know how long this has been broken for
  • updating page_touched would probably trigger a reparse of the page (and currently that doesn't happen for the case described)

Other notes:

  • Should check that AffectedPagesFinder actually lists the page when adding or removing a sitelink.
  • Check the compact diff used for change propagation actually has the sitelink change in it (it probably does)
  • Virtual usages comes into play here.
  • updating page_touched would probably trigger a reparse of the page (and currently that doesn't happen for the case described)

In practical terms, WikiPageUpdater should probably just call WikiPage::doPurge(), instead of trying to duplicate the functionality.

  • Virtual usages comes into play here.

Being connected via sitelink is the original "virtual usage" considered by AffectedPagesFinder. We discussed also treating the description from wikidata as a virtual usage, since it's visible in the mobile view. Don't remember how that played out.

  • Virtual usages comes into play here.

Being connected via sitelink is the original "virtual usage" considered by AffectedPagesFinder. We discussed also treating the description from wikidata as a virtual usage, since it's visible in the mobile view. Don't remember how that played out.

T191831: Track implicit use of non-overridden wikidata description is still open, at least.

I was able to reproduce this locally.
Let's have fun.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptMon, Feb 17, 9:08 PM

Okay, I spent several hours debugging and trying to figure out what's going on and how I0cd13c424 broke it.

Spoiler alert: It's not logic in the hook or issues with virtual usages.

Let's break down the steps to reproduce it:

  • Create a page on wikipedia that includes some data from a statement on wikidata item
    • This creates a ParserCache entry without the wikidata data (stuff that goes into page_props) because it's not yet connected to Wikidata
  • Add a sitelink from the item to the page
    • This happens in Wikidata, it means it triggers a row in wb_changes table that later gets dispatched to the client wiki
    • The dispatched script, sees a need to run refreshLinks on that page and thus enqueues ChangeNotification job
    • The ChangeNotification job enqueues a refreshLinks job
    • The great refreshLinks job needs a ParserOutput object, it tries to load it but with the change merged it tries to load PO object from PC:
      • Oh there's one PC object already for when the page was made originally and my rootJobTimestamp + 10 seconds is still younger than the original PC
      • Look, the revision id hasn't changed since then, same as page_touched
      • Cool, let's just use the cached PO instead

As the result:

  • The page_props will not be updated for the page

The code at fault is this:

	/**
	 * Get the parser output from cache if it reflects the change that triggered this job
	 *
	 * @param ParserCache $parserCache
	 * @param WikiPage $page
	 * @param RevisionRecord $currentRevision
	 * @param StatsdDataFactoryInterface $stats
	 * @return ParserOutput|null
	 */
	private function getParserOutputFromCache(
		ParserCache $parserCache,
		WikiPage $page,
		RevisionRecord $currentRevision,
		StatsdDataFactoryInterface $stats
	) {
		$cachedOutput = null;
		// If page_touched changed after this root job, then it is likely that
		// any views of the pages already resulted in re-parses which are now in
		// cache. The cache can be reused to avoid expensive parsing in some cases.
		$rootTimestamp = $this->params['rootJobTimestamp'] ?? null;
		if ( $rootTimestamp !== null ) {
			$opportunistic = !empty( $this->params['isOpportunistic'] );
			if ( $opportunistic ) {
				// Neither clock skew nor DB snapshot/replica DB lag matter much for
				// such updates; focus on reusing the (often recently updated) cache
				$lagAwareTimestamp = $rootTimestamp;
			} else {
				// For transclusion updates, the template changes must be reflected
				$lagAwareTimestamp = wfTimestamp(
					TS_MW,
					wfTimestamp( TS_UNIX, $rootTimestamp ) + self::NORMAL_MAX_LAG
				);
			}

			if ( $page->getTouched() >= $rootTimestamp || $opportunistic ) {
				// Cache is suspected to be up-to-date so it's worth the I/O of checking.
				// As long as the cache rev ID matches the current rev ID and it reflects
				// the job's triggering change, then it is usable.
				$parserOptions = $page->makeParserOptions( 'canonical' );
				$output = $parserCache->getDirty( $page, $parserOptions );
				if (
					$output &&
					$output->getCacheRevisionId() == $currentRevision->getId() &&
					$output->getCacheTime() >= $lagAwareTimestamp
				) {
					$cachedOutput = $output;
				}
			}
		}

		if ( $cachedOutput ) {
			$stats->increment( 'refreshlinks.parser_cached' );
		} else {
			$stats->increment( 'refreshlinks.parser_uncached' );
		}

		return $cachedOutput;
	}

I was able to confirm that this patch causes it by making the cache function return null all the time and saw the records appearing in page_props.

Options:

  • Introduce an opposite of isOpportunistic option and make the refreshlinks jobs enqueued by wikibase to set that.
    • Side note: isOpportunistic is a pretty vague name, what does an "opportunistic" job do? It's also completely undocumented. It's also a string flag, very easily typo-able.
  • Make the refreslinks job run the hook when it gets the cached PC. You might need to move it to a public function and invoke it there. It's rather complicated, you need the base content object there.
  • Completely drop the whole logic of "let's grab the cached version first". The stats says it's used quite a lot but I think most of them is indeed Wikibase jobs which also makes the first option and this option the same performance-wise.

Once again, the code base seems extremely convoluted and complex, I'm not sure if this complexity is the only option given our complex infra or it can be improved.

cc @aaron @tstarling

Updating page touched would also work?

  • updating page_touched would probably trigger a reparse of the page (and currently that doesn't happen for the case described)

In practical terms, WikiPageUpdater should probably just call WikiPage::doPurge(), instead of trying to duplicate the functionality.

This would update page touched via Title::invalidateCache as far as I can see.

Updating page touched would also work?

  • updating page_touched would probably trigger a reparse of the page (and currently that doesn't happen for the case described)

In practical terms, WikiPageUpdater should probably just call WikiPage::doPurge(), instead of trying to duplicate the functionality.

This would update page touched via Title::invalidateCache as far as I can see.

This already happens (changing page_touched) through HTMLCacheUpdateJob and that's one of the reasons the problem exists, because the try_to_get_cached_parser_output method (the aforementioned method I pasted) checks if the page_touched is newer than the rootTimeStamp which by definition is (root job is the change notification one that spawns HTMLCacheUpdateJob)

Moving to waiting, waiting on core platform again now I guess