Page MenuHomePhabricator

Wikidata item ID changes caused by merges do not update entities on Structured data on Commons
Open, HighPublic

Description

Merging 2 items on Wikidata results in one item redirecting to the other item. There is some process on Wikidata to update statements in other items that link to redirected items that replaces them with the target item ID. The same process do not updates entities on Structured data on Commons. For example, I merged Q18689466 and Q30007380. File:Borovikovsky_portrait_of_Kurakine_A_1802.jpg has Structured data with P6243 set to Q30007380. There should be some process to replace that item ID with Q18689466.

Event Timeline

Jarekt created this task.Nov 11 2019, 3:18 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 11 2019, 3:18 AM

The process on Wikidata is being manually updated by KrBot like here.

I did not realized that we have volunteer run bot to fix redirects. I thought that was a task of internal wikidata software, the way page renames are automatically updated by wikimedia software (I assume?). Perhaps wikidata software should take over those tasks which will help with proper implementation of un-merge capabilities as discussed at T237262.

Tacsipacsi added a comment.EditedNov 22 2019, 9:50 PM

Links to renamed pages aren’t updated by the software, either (which comes handy when renaming is necessary due to a disambiguous title). This handiness is not the case when merging Wikibase items, of course.

Keegan added a subscriber: Keegan.Dec 10 2019, 5:30 PM

It seems like it is an user operated bot that updates redirects on Wikidata. I proposed to create similar got on Commons, see Commons:Bots/Work_requests#update_redirected_wikidata_items_used_by_SDC but there does not seem to be much response. Part of the issue is that at the moment there does not seem to be a way to even query for such redirects. I tried to write one, like this one using sdcquery.wmflabs.org, but could not get it to work. I tried to get some advice on creating such a query at Wikidata:Request_a_query#finding_redirected_wikidata_items_used_by_SDC but got no replies. So I guess we are blocked by lack of reliable querying system that can access both SDC and Wikidata.

Quite not efficient, but querying the Wikidata web API for all redirects (e.g. https://www.wikidata.org/w/api.php?action=query&generator=allredirects&garfrom=Q) and running them through SDC query service should do the trick.

Cparle added a subscriber: Cparle.Dec 16 2019, 3:58 PM

Creating a user operated bot to reflect Wikidata redirects on Commons, similar to the way it's handled on Wikidata per the comment below, may be possible now with the WCQS. I wonder if @LucasWerkmeister has suggestions here?

It seems like it is an user operated bot that updates redirects on Wikidata. I proposed to create similar got on Commons, see Commons:Bots/Work_requests#update_redirected_wikidata_items_used_by_SDC but there does not seem to be much response. Part of the issue is that at the moment there does not seem to be a way to even query for such redirects. I tried to write one, like this one using sdcquery.wmflabs.org, but could not get it to work. I tried to get some advice on creating such a query at Wikidata:Request_a_query#finding_redirected_wikidata_items_used_by_SDC but got no replies. So I guess we are blocked by lack of reliable querying system that can access both SDC and Wikidata.

WCQS queries that have to look things up on wikidata have a very tiny connection between WCQS and WDQS and are so under-powered it can not run queries of that complexity. See T261716 where I was just advised that for queries that rely on WDQS an "offline processing via the dumps might be a better option" because "we want to have strong limits on resource consumption". I personally have no idea how to use commons and wikidata dumps to stand up a database to do a simple SPARQL query. And how to rebuild that weekly.

It seems to be a task better perform by the database itself.

Note that KrBot only updates statements pointing to redirects after a certain time has passed (a week, I believe). This is by design: otherwise, it would be more difficult to “untangle” bad merges, since after undoing the merge on the item itself, you could not distinguish between statements that should now point back to the original item and statements that always pointed to the other (merge target) item.

This is less of a problem nowadays, since we have edit groups and KrBot assigns an edit group per processed merge (and you can undo the entire edit group); on the other hand, edit groups are a Wikidata-only tool (compare T203557), so on Commons you would still have this problem of being unable to untangle bad merges.

I have trouble imagining Wikibase automatically updating references to redirects after such a long delay, though.

Furthermore, from a technical perspective I doubt it’s even possible for Wikibase to update statements on Commons when items on Wikidata are merged. Within Wikidata, the software could discover all the affected pages through the pagelinks table, but on Commons I don’t know how Wikibase would even find the affected pages that need to be edited. (MediaInfo statements don’t seem to automatically record entity usage for entities used in statements; compare the page information for a Test Commons file with one statement.)

Note that KrBot only updates statements pointing to redirects after a certain time has passed (a week, I believe). This is by design: otherwise, it would be more difficult to “untangle” bad merges, since after undoing the merge on the item itself, you could not distinguish between statements that should now point back to the original item and statements that always pointed to the other (merge target) item.

I think it is great that KrBot delays updating redirected items as we do have a lot of bad merges, by either vandals or incompetent editors and unmerging items is a mess. I did propose to create some tool for undoing merges as simple as merges, see T237262. If merge and un-merge operations were build into wikibase system than the system would be able to track all the changes which were done in the merge (possibly using an edit group) and can undo all of those. In such a case there is no need for the delay KrBot uses.

Furthermore, from a technical perspective I doubt it’s even possible for Wikibase to update statements on Commons when items on Wikidata are merged. Within Wikidata, the software could discover all the affected pages through the pagelinks table, but on Commons I don’t know how Wikibase would even find the affected pages that need to be edited. (MediaInfo statements don’t seem to automatically record entity usage for entities used in statements; compare the page information for a Test Commons file with one statement.)

Purging pages on Commons affected by changes on Wikidata is a ongoing issue. See for example T173339 where pages on Commons are not purged when wikidata items they link to are updated. But you are right that files on Commons do not seem to keep track of entities they are connected to through SDC statements, the way they keep track of the same entities linked through lua calls originating from file description wikitext. For example File:Seneca_Rocks_climbing_-_13.jpg infopage in "Wikidata entities used in this page" section has a link to Seneca Rocks (Q7450337) because it is linked from infobox template, but links to other depicted entities like Traditional climbing (Q2214812) is missing as it only shows up in SDC tab. That is a problem.

Gehel triaged this task as High priority.Sep 15 2020, 8:05 AM