Page MenuHomePhabricator

In some cases, moving or deleting pages on a client wiki does not result in sitelink updates / removal on Wikidata.
Open, Needs TriagePublic

Description

Problem:
In some cases, moving or deleting pages on a client wiki does not result in sitelink updates / removal on Wikidata.

Root causes:
The most common causes for this seem to be the following:

Common:

  1. A user performs "move without leaving a redirect behind" on a redirect page on a client wiki. (The corresponding action on Wikidata might fail because redirects are not allowed as sitelinks.)
  2. A user performs "move without leaving a redirect behind" on a client wiki and the page is moved to another namespace.
  3. Editing of an Item is restricted (e.g. semi-blocked) for the user.
  4. Deletions (much smaller number, but maybe only because of interventions from "User:Hoo Bot")
  5. Other reasons: There seem to be plenty of cases that need a different explanation.

Rare:

  1. Page moves and deletions are not processed if you have an SUL and have never visited Wikidata.
  2. User is blocked on Wikidata but not on the client wiki.

(This section was mostly based on research by MisterSynergy.)

Examples:
The page at https://species.wikimedia.org/wiki/T.C._Narendran was deleted, but sitelink at Q7672638 wasn't removed.

Solution:

  • Let's evaluate possible solutions for the common causes.

Notes:

  • Let's try to think of solutions without dummy users for now.
  • Ignoring editing restrictions (like semi-blocks or Item protections) when updating client wiki updates is fair game.

BDD:
GIVEN
AND
WHEN
AND
THEN
AND

Acceptance criteria:

  • []

Open questions:

  • In what cases would it be better to remove the sitelink rather than update it (e.g. when a page is moved from main to user namespace in the client wiki).
  • What are other explanations? (MisterSynergy: Maybe "Wikidata is read-only" at the moment of the edit can also lead to missed sitelink updates.)

Original:

  • When a page is moved on a client wiki, the sitelink should be updated on Wikidata.
  • When a page is deleted on a client wiki, the sitelink on Wikidata should be removed.

This generally happens except when the users is unknown to Wikidata (or blocked).

The question is how these edits should appear on Wikidata.

Forum discussion: https://www.wikidata.org/w/index.php?title=Wikidata:Contact_the_development_team&oldid=367197823#Deletion_on_WikiSpecies_not_on_Wikidata.3F

Additional discussions:

Event Timeline

The problems here are, that sometimes accounts either don't exist on Wikidata or are not allowed to edit items (or sometimes maybe even both).

Possible solutions:

  • Ignore editing restrictions (like blocks or Item protections)
  • Crate accounts on Wikidata when needed (this could become a little ugly given that we would need to use low level CentralAuth functionality for this)
  • Use a dummy user to do the edits if they for one or another reason(s) can't be attributed to the actual user.

A dummy user seems okay for a short-term and could probably be worked regardless of a long-term resolution.

The dummy user solution sounds good to me. Magnus Manske is doing something like this with his QuickStatementsBot so maybe a special purpose Bot account on wikidata for this?

I'd say let's go for the dummy user and somehow record the user who initiated it (in a log message?).

Please bear in mind when fixing this that there are three items (Q5268366, Q16503 and Q4026300) that we have fully protected because we deliberately don't want page moves to be reflected in Wikidata. Those are discussion pages where some projects archive the page by moving the page.

I personally think ignoring semi-protection when moving pages on other wikis would make a lot of sense, it's one of the reasons why we avoid protecting pages (see T189412)

Are there plans to address this in 2019? Is any feedback needed?

I think we need some sort on consensus on what is actually wanted, yeah.

I think all the exceptions that @Nikki mentioned can be covered with abuse filters regardless of the implementation solution you choose. For the rest, the important thing is to have the problem solved (have a sitelink updated whenever the linked page is moved and have a sitelink removed whenever the linked page is deleted) and I would personally be okay with any of the options that @hoo mentioned in 2016, including the dummy user.

I'll have a closer look next week and try to figure out how to move forward.

I came across some of these cases and thought the situation could require some tidying, so I wrote a script which lists sitelinks to inexistent client wiki pages in order to process them. Some patterns that I notice after closely looking at dewiki, ptwiki, and cawiki:

  • There are several hundred such cases for each of these three wikis; cawiki is even clearly above 2000 cases. These sitelinks to inexistent pages are still there in Wikidata, but I plan to remove them soon.
  • Both "User does not exist at Wikidata" and "User is blocked at Wikidata" are super rare scenarios that might not even be worth to worry about. In almost all cases, neither of these is the case.
  • The clear majority of cases are the result of "move without leaving a redirect behind" actions on client wikis; deletions make up for a much smaller number.
  • Two patterns that happens surprisingly often:
    • A user performs "move without leaving a redirect behind" on a redirect page on a client wiki. I suppose the corresponding action on Wikidata fails because redirects are not allowed as sitelinks.
    • A user performs "move without leaving a redirect behind" on a client wiki and the page is moved to another namespace. Is it intentional that the corresponding action on Wikidata fails, or does it happen by accident? In some cases, it would indeed be better to remove the sitelink rather than updating it, e.g. when a page is moved from main to user namespace in the client wiki.

There are still plenty of cases which cannot be explained in any of these ways. Maybe "Wikidata is read-only" at the moment of the edit can also lead to missed sitelink updates.

Manuel renamed this task from [feature request] remove sitelinks / update sitelinks on Wikidata when pages are deleted/moved on client wikis (all users) to In some cases, moving or deleting pages on a client wiki does not result in sitelink updates / removal on Wikidata..Tue, Jun 21, 9:11 PM
Manuel updated the task description. (Show Details)

Thank you so much for your research @MisterSynergy, this helped a lot! In case, by any chance, you still have your list of cases (and analysis), then please let me know.

@Manuel:

  • I got a bot task approved that allows me to tidy these sitelinks up regularly (i.e. remove from the item if the page is inexistent on the client wiki). This itself can be considered a "dirty" solution to the problem, but clearly not the best one.
  • However, it has not been executed yet due to a lack of time for Wikidata on my side in recent months.
  • AFAIR, the main issue currently is that the evaluation workflow is kinda demanding regarding memory usage. During drafting the code on PAWS with its 3 GB memory limit, I offloaded parts of the evaluation for larger wikis to my local machine which has sufficient memory available. For a fully automated deployment on Toolforge, this is of course not possible. Instead, there may even be stricter memory limits applying on Toolforge than on PAWS.
  • Why does it need so much memory? My approach queries "all pages per client wiki" (from the client's page table) and "all sitelinks in Wikidata" (from Wikidata's wb_items_per_site table) into separate Pandas DataFrames and subsequently looks for differences using Python. In other words: I avoid checking millions of cases individually by sitelink, and use a pretty quick per-client-wiki approach instead that requires me to hold all information for a given client wiki in memory.

So, the code itself is pretty much ready-to-roll, but I need to find a place to run this fully automated. If you are interested, I can try to generate an updated list of cases for further inspection but it would be helpful to really understand your needs. Do you want to further evaluate this?