Page MenuHomePhabricator

Wikidata changes do not get sent to client sites on initial sitelink addition (in some cases), leading to things such as missing page props in page_props table
Closed, ResolvedPublic

Description

https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/501444/ seems to have changed the behaviour of the links update job to use the parser cache when a page has an entry.
As a result of this when a wikibase repo (wikidata) has an edit that it dispatches to a client to update its pages sometimes ContentAlterParserOutput will not run, which is currently what adds the wikibase_item page prop.

After a little more digging, it looks like perhaps the behaviour actually changed at the end of 2018 in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/465157/
At least in this patch it looks as though refreshlinks requests the parser output from RevisionRendered with with generate-html set to the result of shouldCheckParserCache, which I believe would be false for a client page having a links update triggered by wikidata dispatching.

This will not happen to pages that get updated that do not have a parser cache entry.
All pages can be fixed with a null edit.
I also assume when a parser cache entry expires the new entry will update the page props correctly (TODO verify / confirm?)

Possible solutions

  • Change where this (and potentially other) updates happen.
  • Allow the wikibase update process to ignore the parser cache in the job
  • Something else?

Bug report

Title: page_props missing links for some Commons category <-> Wikidata sitelinks

Some Commons categories are linked to Wikidata, but are missing from the page_props table in the database. Examples:

https://commons.wikimedia.org/wiki/Category:Broadway_East,_Baltimore
https://commons.wikimedia.org/wiki/Category:Buddhist_temples_in_Lamphun_Province
https://commons.wikimedia.org/wiki/Category:Buddhist_temples_in_Ubon_Ratchathani_Province
https://commons.wikimedia.org/wiki/Category:Civil_law_notaries
https://commons.wikimedia.org/wiki/Category:Climate_change_conferences
https://commons.wikimedia.org/wiki/Category:Former_components_of_the_Dow_Jones_Industrial_Average
https://commons.wikimedia.org/wiki/Category:Dukes_of_the_Archipelago
https://commons.wikimedia.org/wiki/Category:Eastern_Catholic_orders_and_societies
https://commons.wikimedia.org/wiki/Category:English_people_of_Turkish_descent

Lucas confirmed this on Twitter https://twitter.com/LucasWerkmeistr/status/1175747434208727040 and there's discussion at https://www.wikidata.org/wiki/User_talk:Jheald#Quarry_oddity . It's not clear why this is happening - the examples I've given are from categories I've found when looking through commons links from enwp (in alphabetical order, hence B-E).

Acceptance criteria

  • page_props are always updated for client page when sitelink is added to the repository

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Mike_Peel I'm wondering if you by chance have any links to recent examples again (indeed with no null edit being dont ideally)

(in that second link, look at items from 'Category:Views from 10 Upper Bank Street (Q106685873)' and earlier.)

Change 686768 had a related patch set uploaded (by Addshore; author: Addshore):

[mediawiki/extensions/Wikibase@master] Fix: When a sitelink change happens, always dispatch to the site

https://gerrit.wikimedia.org/r/686768

β€’ Addshore renamed this task from page_props wikibase_item is sometimes not added to client pages when a sitelink is added on a repo to Wikidata changes do not get sent to client sites on initial sitelink addition (in some cases), leading to things such as missing page props in page_props table.May 7 2021, 9:27 PM
β€’ Addshore added subscribers: jeblad, Nirmos.

The investigation at in T280627 seems to have got to the bottom of a cause for this.
Now that a cause seems to have been identified I think we can safely merge these 2 other tickets into here, as this is not a commons only issue it seems, and this has been broken for ~ 10 months.

Change 686691 had a related patch set uploaded (by Addshore; author: Addshore):

[mediawiki/extensions/Wikibase@master] Revert "Remove rootJobTimestamp from refreshLinks jobs in client"

https://gerrit.wikimedia.org/r/686691

Change 686768 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Fix: When a sitelink change happens, always dispatch to the site

https://gerrit.wikimedia.org/r/686768

β€’ Addshore claimed this task.

I'm sure happy to say that this should no longer be happening :)
Thanks to all of those involved in investigating and figuring this out.
And thanks to @Mike_Peel for helping us stay on top of production cases of the issue!

Thanks for fixing it! From some initial checks, all looks good to me. Pi bot will next bulk-create items on the 1st June, so I'll be able to check then whether everything is fixed (if you don't hear me complaining around then, it's good news. ;-) )

@Addshore OK, this is odd now. Take for example https://commons.wikimedia.org/wiki/Category:IMO_9869162 - for which Pi bot created https://www.wikidata.org/wiki/Q107073476 on 2 June. On Commons you can see the Wikidata link, which you couldn't before. However, Pi bot is still not seeing this and adding the infobox, using the query in:
https://bitbucket.org/mikepeel/wikicode/src/master/query_infobox_candidates.py
which I run nightly on toolforge.

So it seems that the situation has *improved* but is not yet *fixed*. Unless I'm doing something wrong in my query?

Change 721083 had a related patch set uploaded (by Addshore; author: Addshore):

[mediawiki/extensions/Wikibase@REL1_36] Fix: When a sitelink change happens, always dispatch to the site

https://gerrit.wikimedia.org/r/721083

Change 721083 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@REL1_36] Fix: When a sitelink change happens, always dispatch to the site

https://gerrit.wikimedia.org/r/721083

Hi everyone! Does T295219 mean that we didn't solve the problem after all?

Hi everyone! Does T295219 mean that we didn't solve the problem after all?

It definitely doesn't work. If we look at https://sv.wikipedia.org/w/index.php?title=Special:UnconnectedPages&limit=500&offset=0&namespace=0 we see 278 pages but most of them are already connected to Wikidata. Some of the connections were made a month ago: https://www.wikidata.org/w/index.php?diff=1516495246

Thank you @Nirmos for confirming this! We'll look into it.

Did something change in the last few days? There was a huge number of backlogged Wikidata-linked Commons categories that seem to now be correct in the database, so the Infobox is being added to them today - which is really great!

Hello, from my point of view the problem still exists, see

https://phabricator.wikimedia.org/T298288

Also, a lot of Commonscats, which have been connected weeks ago still do not have a infoboxes, only a few of them got an infobox by Pi bot.

Pi bot is still running through Commons categories - normally it starts and finishes in the early morning, staying running this late in the day is unusual. (I've also modified it this evening to go through date categories, which is how I spotted this). Maybe still a backlog, but at least part of it seems to have been sorted out now.

FWIW, I looked at how dispatching works and we changed its infrastructure as part of a project before departing WMDE. We did it together with Michael who also knows a lot in this topic. I have some ideas on how to improve this which I wrote in details in T297238#7556375 each one of the ideas can be resourced and picked up by the team if they see it's doable. I'd be more than happy to talk about any of those.

Michael added a subscriber: Michael.

I'll have a look. Our monitoring didn't pick up anything unusual, so we're missing something there at the very least.

Change 751971 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/Wikibase@master] Fix not fully dispatching Changes that added a sitelink

https://gerrit.wikimedia.org/r/751971

Change 751971 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Fix not fully dispatching Changes that added a sitelink

https://gerrit.wikimedia.org/r/751971

This patch should roll out with the train deployment next week. Due to dealing with jobs, it is very hard to test on a maintenance host, so it isn't really suitable for a backport.

This should be fully fixed and that fix is deployed since about January 20th. Though old pages might still need a manual purge with forceLinksUpdate.

Please let me know if you still see issues.

@Michael I am running Wikibase REL1_38 (from source, commit 33c2c9226c) and am still seeing what I believe is this issue.

I think the subtle difference in my case is, from what I can tell, the fix you merged makes an adjustment to the DispatchChanges job, so that it will push the changes to new sitelinks. But that requires there to be a DispatchChanges job, which requires something to change somewhere. My rate of change is very low, so I do not always have triggers for the DispatchChanges job.

I think the reason this worked before (REL1_37) for me is that the old dispatchChanges.php maintenance script essentially forced a DispatchChanges job every time it ran. Under the new system, if there are no changes made, there is no DispatchChanges job created.

Again, the above is if my understanding is correct, which it may not be. I'm still digging into everything.

For more information, I think I'm encountering a race condition between making changes and adding the subscription. If the recent change is processed before the refreshLinks job runs, then there is no new recent change processing to trigger dispatch when the refreshLinks job creates the sitelink/subscription.

Consider the case where the subscribing page is created first. I.e.

  1. Create a new property (let's call it P1)
  2. Create a new Article (let's call it A)
  3. Add the following wikitext to the Article:
"Connected to the item: {{#statements:P1}}"
  1. Add a sitelink to the page (creating a new Item, let's call it Q1)
  2. Add a statement for property P1 to the new Item.

If the DispatchChanges for (5) runs before the refreshLinks job that will create the subscription from A to Q1 (I'm unclear on if that job is scheduled by (3) or (5)), then it is unlikely that there will be a RecentChange happening after the refreshLinks job to dispatch changes to A, and A will continue to say "Connected to the item: " without having the value from P1 on Q1 added.

CtrlZvi reopened this task as Open.EditedSep 23 2022, 4:09 PM

Reopening because after digging further, I'm more convinced that my above analysis is correct.
tl;dr: There is a race condition between change dispatch and subscription creation that used to be resolvable via manual dispatch running in a loop but no longer has a resolution.

I have managed to mitigate this in my use case by pulling the refreshLinks job out of normal job queue processing and running it much more often. But that only makes the race condition window smaller, it does not eliminate it.

Upon even more digging, my issue was similar, but actually subtly different. I've opened a separate issue (T318501) for it.