Page MenuHomePhabricator

Changes to Structured Data on Commons should trigger page refresh
Open, Needs TriagePublic

Description

After T223792 allowed lua access to Structured data, so now we do have categories which compare data stored as Structured data and in wikitext, one of them is c:Category:Artworks_with_mismatching_structured_data_P6243_property. The system works fine, except that if information in Structured data is corrected, in such a way that the page should disappear from the category, that edit does not trigger the page refresh the way edits to wikitext of the same file does. Null edit to the page is needed to fix the issue.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This is related to T173339, where changes to Wikidata should trigger page refresh of pages which rely on that data.

Change 602053 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/core@master] Remove unwanted parse step

https://gerrit.wikimedia.org/r/602053

Change 602053 merged by jenkins-bot:
[mediawiki/core@master] Remove unwanted parse step

https://gerrit.wikimedia.org/r/602053

Can you confirm whether this has been fixed?

@Ramsey-WMF I just did bunch of SDC edits which should have removed a tracking category for files missing SDC statement. Afterwards the category should have been empty, but the number of files did not change. I run a quick touch operation (add and empty line to the end of the file) and afterwards the category is empty. So no it is not working for me.

The title of the task made me think it's about refreshing the edited page (ie. the page itself containing outdated/inconsistent data after the edit), but I guess it's actually about refreshing the category page (ie. recursive links update)? Those are different things. I think the patch should have fixed the first, I am unsure about the second.

Disregard that, I was confused. This has nothing to do with recursive updates and as far as I can see the patch should have helped.

There are 4 tickets about what is likely the same thing - 1 Wikidata, 3 Commons. Since we're looking track, I'll merge the other 2 Commons-specific tickets into this one.

Current status was already being tracked in 1 of these other tickets, but I'll recap here:

  • the patch did resolve the problem on my less complex local setup
  • it does not, however, seem to have much of an effect in production
  • I suspect that an extension subscribes to a pre-save hook (PageContentSave?) and triggers a premature render (similar to what core used to do)
  • I don't know what extension
  • I don't know what layer ends up being cached (I suspect one of Wikibase's lookups)
  • even if we know what's caching it, it probably isn't possible to circumvent it in this particular instance

So yeah, not currently fixed, will need significant additional debugging...

Trying to get a view of this ticket and the issue described in the description without much context and I'm finding it hard to follow.
Would someone be able to write out some more fleshed out steps? not just the order but also the content that is changing where, and what exactly is the expected outcome of this?

I try to highlight the bits I'm not following below:

    • categories which compare data stored as Structured data and in wikitext
      • What exactly does this mean? an example of how things are added to the category etc would be appreciated
  • if information in Structured data is corrected
    • Corrected how? what is the edit / change that happens?
  • in such a way that the page should disappear from the category
    • What is expected to cause the page to be removed from the category? I not the structured data edit directly? but some intermediate thing on a wikitext page somewhere? Is it the file description page of the media info entity?
  • does not trigger the page refresh the way edits to wikitext of the same file does
    • the page here being the file page? or the category page listing pages?
  • Null edit to the page is needed to fix the issue.
    • null edit of the category page? or a file page?

Example category: https://commons.wikimedia.org/wiki/Category:Artworks_with_mismatching_structured_data_P6243_property. This category is set by {{artwork}} when |wikidata= is set but it does not equal P6243 (digital representation of) in SDC.

I've set P6243 to the wrong value for https://commons.wikimedia.org/wiki/File:Self-portrait_by_P%C3%A2rvu_Mutu.jpg. The category appears on the file page and the file in the category, as expected.

image.png (678×781 px, 65 KB)
image.png (847×727 px, 325 KB)

When I change P6243 to the correct value, there is no visible change outside of the changed property, as the page does not refresh in the browser.

image.png (616×770 px, 65 KB)

If I press F5 and refresh the page, however, the category still appears on the file page. There's no change compared to the previous screenshot. Ctrl-F5 also has no effect. The file also still appears in the category.

I then used the UTCLiveClock gadget to purge the file page. This causes the following POST request to https://commons.wikimedia.org/w/api.php (form encoded data changed to JSON for readability):

Request data
{
    "action": "purge",
    "format": "json",
    "titles": "File:Self-portrait_by_Pârvu_Mutu.jpg"
}
Response data
{
	"batchcomplete": "",
	"purge": [
		{
			"ns": 6,
			"title": "File:Self-portrait by Pârvu Mutu.jpg",
			"purged": ""
		}
	],
	"normalized": [
		{
			"from": "File:Self-portrait_by_Pârvu_Mutu.jpg",
			"to": "File:Self-portrait by Pârvu Mutu.jpg"
		}
	]
}

After the purge, the file description page appears correct (the category is gone).

image.png (585×775 px, 61 KB)

However, the file still appears in the category on the category page. Purging the category page has no effect.

If I modify the purge to use forcelinkupdate

Request data
{
    "action": "purge",
    "format": "json",
    "titles": "File:Self-portrait_by_Pârvu_Mutu.jpg",
    "forcelinkupdate": true
}
Response data
{
	"batchcomplete": "",
	"purge": [
		{
			"ns": 6,
			"title": "File:Self-portrait by Pârvu Mutu.jpg",
			"purged": "",
			"linkupdate": ""
		}
	],
	"normalized": [
		{
			"from": "File:Self-portrait_by_Pârvu_Mutu.jpg",
			"to": "File:Self-portrait by Pârvu Mutu.jpg"
		}
	]
}

The file no longer appears in the category.

image.png (798×742 px, 270 KB)

Here's a simplified example:

Add {{Sandbox/Mmullie_(WMF)/CheckIfHasStatement}} to a file (e.g. https://commons.wikimedia.org/wiki/File:2019-09-16-tenerife-icod.jpg)
That will add a category ([[Category:No statements]]) if there are no statements (see Lua at https://commons.wikimedia.org/wiki/Module:Sandbox/Mmullie_(WMF)/my_module), or remove that category once there are.
Now try editing that file (removing all statements, adding new one) and you'll find that the categorylinks aren't immediately updated - not until the page is (null-)edited once more.

Trying to get a view of this ticket and the issue described in the description without much context and I'm finding it hard to follow.
Would someone be able to write out some more fleshed out steps? not just the order but also the content that is changing where, and what exactly is the expected outcome of this?

These are usually categories that get created via a Lua script evaluating a condition.
E.g. if a file has (or lacks) a certain statement, it'll add a certain category for tracking purposes.
When a property is added/removed, this prompts the script to re-evaluate and update categories accordingly.
What currently ends up happening is that these pages (where the Lua script ends up running and is supposed to update the categorylinks for) need another (null) edit to reflect the change - they're always one behind.

What happened is that the process of storing a new revision used to parse the old page once more right before the data actually ends up being stored.
Because the data is not yet in storage at that point, things are rendered under the pre-save conditions (with the *old* categorylinks)
The render itself is not necessarily a problem, but anything related to rendering the page might end up caching bits, which will be based off of outdated information.
Later on, post-save, categorylinks etc. are updated as well based off of (newly) rendered content, but parts of that content may have been filled in by incorrect data in cache, as a result of the earlier render.

There already was a patch to get rid of core's pre-save render. It resolved the problem on my local machine, but not on Commons.
There are pre-save hooks in this area as well, though, so I suspect that an extension hooks into the process and also triggers a render.
(I need to look into this in more detail, but I suspect PageContentSave is the one, and it's deprecated, so the remaining problem might eventually end up going away automatically)

To replicate this on a local setup, try subscribing to PageContentSave and call $mainContent->getParserOutput(); - that causes the same behavior we see on prod.

The only extension in https://www.mediawiki.org/wiki/Category:PageContentSave_extensions that is on WMF sites is TemplateData. The hook calls rETDA includes/TemplateDataHooks.php:46, which then intentionally parses the page early.

/**
 * @param WikiPage &$page
 * @param User &$user
 * @param Content &$content
 * @param string &$summary
 * @param bool $minor
 * @param bool|null $watchthis
 * @param string $sectionanchor
 * @param int &$flags
 * @param Status &$status
 * @return bool
 */
public static function onPageContentSave( WikiPage &$page, &$user, &$content, &$summary, $minor,
	$watchthis, $sectionanchor, &$flags, &$status
) {
	// The PageContentSave hook provides raw $text, but not $parser because at this stage
	// the page is not actually parsed yet. Which means we can't know whether self::render()
	// got a valid tag or not. Looking at $text directly is not a solution either as
	// it may not be in the current page (it can be transcluded).
	// Since there is no later hook that allows aborting the save and showing an error,
	// we will have to trigger the parser ourselves.
	// Fortunately this causes no overhead since the below (copied from WikiPage::doEditContent,
	// right after this hook is ran) has guards that lazy-init and return early if called again
	// later by the real WikiPage.

	// Specify format the same way the API and EditPage do to avoid extra parsing
	$format = $content->getContentHandler()->getDefaultFormat();
	$editInfo = $page->prepareContentForEdit( $content, null, $user, $format );

	$templateDataStatus = self::getStatusFromParserOutput( $editInfo->output );
	if ( $templateDataStatus instanceof Status && !$templateDataStatus->isOK() ) {
		// Abort edit, show error message from TemplateDataBlob::getStatus
		$status->merge( $templateDataStatus );
		return false;
	}
	return true;
}
  1. rETDA includes/TemplateDataHooks.php:75 in TemplateDataHooks::onPageContentSave
  2. rMW includes/edit/PreparedEdit.php:119
  3. rMW includes/edit/PreparedEdit.php:104 in PreparedEdit::getOutput
  4. rMW includes/Storage/DerivedPageDataUpdater.php:1337 in DerivedPageDataUpdater::getCanonicalParserOutput
  5. rMW includes/Revision/RenderedRevision.php:197 in RenderedRevision::getRevisionParserOutput
  6. rMW includes/Revision/RevisionRenderer.php:152 in RevisionRenderer::getRenderedRevision
  7. rMW includes/Revision/RevisionRenderer.php:230 in RevisionRenderer::combineSlotOutput
  8. rMW includes/Revision/RenderedRevision.php:235 in RenderedRevision::getSlotParserOutput
  9. return $content->getParserOutput(... at rMW includes/Revision/RenderedRevision.php:262 in RenderedRevision::getSlotParserOutputUncached

@matthiasmullie, in case it helps: most of out SDC tracking templates are based on c:Module:SDC_tracking and templates calling it. You probably recreated something similar for testing, but in case it is easier to test with already existing code, here it is.

Schlurcher subscribed.

Note that this behaviour has resulted in T280232 'Uncached wiki requests partially unavailable due to excessive request rates from a bot'

matmarex moved this task from Backlog to External on the Editing-team (Tracking) board.
matmarex subscribed.

We (Editing team) got tagged because of the TemplateData tag, but this doesn't look like a bug in TemplateData to me, so I'm moving this to external. Let us know if there's actually something we should do.

I have discovered this bug too, while testing SDC edits with OpenRefine. See this discussion on Multichill's talk page.

As I understand correctly, to circumvent this issue, we would need to introduce null edits into OpenRefine that trigger a page update.

Maintenance categories like these are very helpful to discover SDC batch editing related tasks, and I think it would be great if it can be fixed on the Commons side.

I also found this bug. and reported it as T313185