Page MenuHomePhabricator

Badges for sitelinks not getting updated in query service after a move
Open, MediumPublicBUG REPORT

Description

There are currently five items (wd:Q108861497 wd:Q113867071 wd:Q6390796 wd:Q17129703 wd:Q112621799; query) which show up in the Query Service as having featured-article badges on their enwiki sitelinks, but do not do so. All five sitelinks *used* to point to an item with a FA badge, but were moved.

For example:

It appears that the query service kept thinking the "Eric Brewer" sitelink was associated with a badge, despite it having been removed and then recreated on a new item. Perhaps it is somehow not deleting the badge records when a sitelink is removed, and then the new sitelink inherits that record when created?

I can't work out how common this is - moving badged articles and then reusing the old title for a disambiguation page is a relatively unusual thing, so it might be common in this particular situation.

Event Timeline

I believe this problem is similar to what was reported in here

My understanding of this problem is as follow:

The wikibase RDF for sitelinks uses the URL of the link as a subject:

<https://en.wikipedia.org/wiki/Eric_Brewer> a schema:Article ;
	schema:about wd:Q1342539 ;
	schema:inLanguage "en" ;
	schema:isPartOf <https://en.wikipedia.org/> ;
	schema:name "Eric Brewer"@en ;
	wikibase:badge wd:Q17437796 .

When altering the site links of an entity only the link between the wikidata entity and the link subject is removed, in the example above only the <https://en.wikipedia.org/wiki/Eric_Brewer> schema:about wd:Q1342539 meaning that the following data:

<https://en.wikipedia.org/wiki/Eric_Brewer> a schema:Article ;
	schema:inLanguage "en" ;
	schema:isPartOf <https://en.wikipedia.org/> ;
	schema:name "Eric Brewer"@en ;
	wikibase:badge wd:Q17437796 .

will remain in blazegraph as orphaned values. Orphaned values are a known problem discussed here: T302189 (TL/DR: removing orphaned values in realtime might be costly).

Here the additional problem is that this orphaned sitelink will get re-attached to another entity if the same sitelink is being used in another wikidata entity.

I could see three options here:

  1. change the wikibase RDF model so that it is less likely that orphaned sitelinks are reused (use reification and never promote the sitelink to a subject, similar to what's done for complex values and references). This is a breaking change that is unlikely to be worthwhile.
  2. consider that this problem is not common enough and accept it as a known limitation of the update process. Rely on more regular data-reloads to fix these discrepancies.
  3. attempt to cleanup orphaned sitelinks in realtime, might not be entirely trivial to do but seems doable, main issue will be related to performances, what if the updater performance degrades too much?

Let me just point to https://www.wikidata.org/w/index.php?title=Wikidata:Project_chat&oldid=1777735426#Why_are_these_items_getting_corrupted_like_this? which covers what might be a similar failure mode, and which you may wish to consider whilst contemplating the above.

For convenience, here's a cut & paste of the short thread:

START
I recently used my wikidata model (discussed above) to identify items that are misclassified. The most suspicious items are listed at User:BrokenSegue/PsychiqConflicts. But one really common pattern is the following:

  • James Lesslie (Q11775583) is linked to en:James Lesslie which is a disambiguation page
  • en:James Lesslie (publisher) is moved on top of en:James Lesslie and changed to be about a specific person
  • James Lesslie (Q6138007) is left orphaned to a redirect page.

Now I thought that when pages moved the reference gets changed on Wikidata. But this isn't happening? Is there something we can change? Can we ask Wikipedia users to be more careful? There are probably hundreds or thousands of errors like this. BrokenSegue (talk) 01:21, 23 November 2022 (UTC)

Probably worth raising at Wikidata:Report a technical problem. The patterns /seems/ to be, pagemove to a hitherto unused name -> WD sitelink is updated; pagemove over an existing page -> WD sitelink is not updated. If so: pagemove over is a routine event; why is it not being recognised / acted on? --Tagishsimon (talk) 02:26, 23 November 2022 (UTC)
END

@dcausse For the sample here (enwiki FAs) it affects ~0.1% over the course of a year, and presumably older ones are being swept up by the reloads. I think that's reasonable enough to file as "a rare problem" if the alternative would cause performance issues.

It might potentially be more of an issue in future as more sitelinks get noted as redirects - these are presumably a bit more likely to be moved around, deleted, replaced etc than the existing pool of badged articles, which are presumably generally quite stable. So might be worth keeping an eye on.

From the comments on the linked thread, I did a test using Q6390796, one of the items with a "ghost FA badge".

  • Adding a GA badge (ie a different badge) to the enwiki sitelink: the query service reports both badges.
  • Removing the GA badge again: it correctly removes that one, but retains the original ghost FA badge
  • Deleting the enwiki sitelink, then re-adding it: the ghost badge returns
  • Manually adding a FA badge: it shows up on search as expected
  • Removing the new FA badge again: it finally disappears

So I can confirm that "add the ghost badge then remove" is a functioning solution for individual items here; editing other badges or simply deleting/readding the sitelink won't do it.

@Tagishsimon I *think* that is a subtly different problem, connected to the deletion not being done quickly enough for the moved sitelink to catch up?

There's an existing bug logged for it as T233435 (also there's T143486 & T143485 though those two seem specific to the case where the local user doesn't have a WD account)

MPhamWMF moved this task from Incoming to Tech Debt on the Wikidata-Query-Service board.