Page MenuHomePhabricator

In some cases, moving or deleting pages on a client wiki does not result in sitelink updates / removal on Wikidata
Open, Needs TriagePublic

Description

Problem:
In some cases, moving or deleting pages on a client wiki does not result in sitelink updates / removal on Wikidata.

Root causes:
The most common causes for this seem to be the following:

Common:

  1. A user performs "move without leaving a redirect behind" on a redirect page on a client wiki. (The corresponding action on Wikidata might fail because redirects are not allowed as sitelinks.)
  2. A user performs "move without leaving a redirect behind" on a client wiki and the page is moved to another namespace.
  3. Editing of an Item is restricted (e.g. semi-blocked) for the user.
  4. Deletions (much smaller number, but maybe only because of interventions from "User:Hoo Bot")
  5. Other reasons: There seem to be plenty of cases that need a different explanation.

Rare:

  1. Page moves and deletions are not processed if you have an SUL and have never visited Wikidata.
  2. User is blocked on Wikidata but not on the client wiki.

(This section was mostly based on research by MisterSynergy.)

Examples:
The page at https://species.wikimedia.org/wiki/T.C._Narendran was deleted, but sitelink at Q7672638 wasn't removed.

Solution:

  • Let's evaluate possible solutions for the common causes.

Notes:

  • Let's try to think of solutions without dummy users for now.
  • Ignoring editing restrictions (like semi-blocks or Item protections) when updating client wiki updates is fair game.

BDD:
GIVEN
AND
WHEN
AND
THEN
AND

Acceptance criteria:

  • []

Open questions:

  • In what cases would it be better to remove the sitelink rather than update it (e.g. when a page is moved from main to user namespace in the client wiki).
  • What are other explanations? (MisterSynergy: Maybe "Wikidata is read-only" at the moment of the edit can also lead to missed sitelink updates.)

Original:

  • When a page is moved on a client wiki, the sitelink should be updated on Wikidata.
  • When a page is deleted on a client wiki, the sitelink on Wikidata should be removed.

This generally happens except when the users is unknown to Wikidata (or blocked).

The question is how these edits should appear on Wikidata.

Forum discussion: https://www.wikidata.org/w/index.php?title=Wikidata:Contact_the_development_team&oldid=367197823#Deletion_on_WikiSpecies_not_on_Wikidata.3F

Additional discussions:

Event Timeline

The problems here are, that sometimes accounts either don't exist on Wikidata or are not allowed to edit items (or sometimes maybe even both).

Possible solutions:

  • Ignore editing restrictions (like blocks or Item protections)
  • Crate accounts on Wikidata when needed (this could become a little ugly given that we would need to use low level CentralAuth functionality for this)
  • Use a dummy user to do the edits if they for one or another reason(s) can't be attributed to the actual user.

A dummy user seems okay for a short-term and could probably be worked regardless of a long-term resolution.

The dummy user solution sounds good to me. Magnus Manske is doing something like this with his QuickStatementsBot so maybe a special purpose Bot account on wikidata for this?

I'd say let's go for the dummy user and somehow record the user who initiated it (in a log message?).

Please bear in mind when fixing this that there are three items (Q5268366, Q16503 and Q4026300) that we have fully protected because we deliberately don't want page moves to be reflected in Wikidata. Those are discussion pages where some projects archive the page by moving the page.

I personally think ignoring semi-protection when moving pages on other wikis would make a lot of sense, it's one of the reasons why we avoid protecting pages (see T189412)

Are there plans to address this in 2019? Is any feedback needed?

I think we need some sort on consensus on what is actually wanted, yeah.

I think all the exceptions that @Nikki mentioned can be covered with abuse filters regardless of the implementation solution you choose. For the rest, the important thing is to have the problem solved (have a sitelink updated whenever the linked page is moved and have a sitelink removed whenever the linked page is deleted) and I would personally be okay with any of the options that @hoo mentioned in 2016, including the dummy user.

I'll have a closer look next week and try to figure out how to move forward.

I came across some of these cases and thought the situation could require some tidying, so I wrote a script which lists sitelinks to inexistent client wiki pages in order to process them. Some patterns that I notice after closely looking at dewiki, ptwiki, and cawiki:

  • There are several hundred such cases for each of these three wikis; cawiki is even clearly above 2000 cases. These sitelinks to inexistent pages are still there in Wikidata, but I plan to remove them soon.
  • Both "User does not exist at Wikidata" and "User is blocked at Wikidata" are super rare scenarios that might not even be worth to worry about. In almost all cases, neither of these is the case.
  • The clear majority of cases are the result of "move without leaving a redirect behind" actions on client wikis; deletions make up for a much smaller number.
  • Two patterns that happens surprisingly often:
    • A user performs "move without leaving a redirect behind" on a redirect page on a client wiki. I suppose the corresponding action on Wikidata fails because redirects are not allowed as sitelinks.
    • A user performs "move without leaving a redirect behind" on a client wiki and the page is moved to another namespace. Is it intentional that the corresponding action on Wikidata fails, or does it happen by accident? In some cases, it would indeed be better to remove the sitelink rather than updating it, e.g. when a page is moved from main to user namespace in the client wiki.

There are still plenty of cases which cannot be explained in any of these ways. Maybe "Wikidata is read-only" at the moment of the edit can also lead to missed sitelink updates.

Manuel renamed this task from [feature request] remove sitelinks / update sitelinks on Wikidata when pages are deleted/moved on client wikis (all users) to In some cases, moving or deleting pages on a client wiki does not result in sitelink updates / removal on Wikidata..Jun 21 2022, 9:11 PM
Manuel updated the task description. (Show Details)

Thank you so much for your research @MisterSynergy, this helped a lot! In case, by any chance, you still have your list of cases (and analysis), then please let me know.

@Manuel:

  • I got a bot task approved that allows me to tidy these sitelinks up regularly (i.e. remove from the item if the page is inexistent on the client wiki). This itself can be considered a "dirty" solution to the problem, but clearly not the best one.
  • However, it has not been executed yet due to a lack of time for Wikidata on my side in recent months.
  • AFAIR, the main issue currently is that the evaluation workflow is kinda demanding regarding memory usage. During drafting the code on PAWS with its 3 GB memory limit, I offloaded parts of the evaluation for larger wikis to my local machine which has sufficient memory available. For a fully automated deployment on Toolforge, this is of course not possible. Instead, there may even be stricter memory limits applying on Toolforge than on PAWS.
  • Why does it need so much memory? My approach queries "all pages per client wiki" (from the client's page table) and "all sitelinks in Wikidata" (from Wikidata's wb_items_per_site table) into separate Pandas DataFrames and subsequently looks for differences using Python. In other words: I avoid checking millions of cases individually by sitelink, and use a pretty quick per-client-wiki approach instead that requires me to hold all information for a given client wiki in memory.

So, the code itself is pretty much ready-to-roll, but I need to find a place to run this fully automated. If you are interested, I can try to generate an updated list of cases for further inspection but it would be helpful to really understand your needs. Do you want to further evaluate this?

@Manuel: I have looked into this again. As of now, I have this list of potential reasons for sitelink update failures:

  1. Sitelink configuration-related reasons
    1. A page on the client is "moved without a redirect" to another namespace that is forbidden (?) at Wikidata (such as a page move from main namespace to "User" or "Draft" namespace, e.g. when the page is not fit for the main namespace).
    2. A redirect page on the client is "moved without a redirect". Redirect sitelinks are not permitted, thus the sitelink cannot be updated.
  2. User-based reasons
    1. The user performing a sitelink change on a client does not have a local account at Wikidata
    2. The user performing a sitelink change on a client is not permitted to edit Wikidata due to a block
    3. The user performing a sitelink change on a client exceeds their rate-limit at Wikidata (e.g. when Special:Nuke is used on a client wiki)
  3. Page-based reasons
    1. The item page is protected to a level that the user performing a sitelink change on a client is not allowed to edit it
  4. Wikidata edit capacities limited
    1. Wikidata editing was generally rate-limited when the client sitelink had been changed (e.g. due to high maxlag), and the user made several sitelink changes in a short time (this might be the case with some bots)
    2. Wikidata was read-only when the client sitelink had been changed
  5. Project configuration-related reasons
    1. There are a couple of cases where a namespace has been renamed on client Wikis, but the sitelinks to that namespace have not been updated on Wikidata. There seem to be auto-redirects in place, but technically the old titles do not exist any more

Currently I see roughly 60.000 sitelinks that do not exist as a page on the client. My impression is that 1A is the major and dominant contributor here, and maybe 4A to some extent as well. I will soon start to repair 1A cases including some logging for future investigations. If the backlog is shorter, I think it should become easier to learn something about the other scenarios as well.

Thank you so much @MisterSynergy, you are amazing! \o/

Your analysis helps a lot! I wonder, if "User:Hoo Bot" might skew the analysis or not. @hoo, could you please give a little detail of what cases the bot is fixing and whether that changes MisterSynergy's analysis?

Also, I am wondering, if we also should contribute to this effort from the developer side or if it makes sense to rely on your bot(s) for now? Should we e.g. at least try to fix 1A (and maybe AA) server-side?

Do you have any thoughts on this?

I don't think "User:Hoo bot" has much influence here as this bot has not edited Wikidata since 2016-10. While many cases are a couple of years old, they are not *that* old in fact. As much as I am aware, nobody has taken care of this for a long time now (but I am determined to do so…)

Just to note that Pi bot was removing some of these links in 2018, focused on tgwiki, but I haven't been running that script recently. Documentation at https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Pi_bot_9 and code at https://github.com/mpeel/wikicode/blob/master/check_tgwiki.py . I'm happy to help with similar bot work if you need, but suspect you've got it well in hand @MisterSynergy!

Yes, you are right @MisterSynergy! I am glad about your repair efforts and will stand down for now. Please let me know if we can contribute something from our side or if you have new insights from your investigations. Thank you, and best of luck with this!

Also, thx for the additional info @Mike_Peel!

Status update: In the past days, I have removed deleted sitelinks for the "easy" cases where the reason is relatively obivous. This has reduced the number of open cases from ~60k to ~6k (i.e. 90% reduction). Findings:

  • Around 6k cases resulted from "move without redirect" scenarios on client wikis. This is much less than what I anticipated earlier, yet still a substantial amount.
  • Around 40k cases resulted from scenarios where the user batch-deleted plenty of pages on the client wiki at a high rate, either by using Special:Nuke or a custom deletion bot script. Since admins on client wikis usually enjoy noratelimit priviledges on the client wiki but not on Wikidata, this causes ratelimit issues when removing the sitelinks from Wikidata items. Since this is by far the most important reason why a deleted page might remain as a sitelink on the Wikidata item, it might be valuable to consider optimizations for this scenario.
  • Another 8k "deleted sitelinks" where not actually deleted, but their namespaces where renamed (on srwikinews and lmowiki only). I have simply updated the sitelinks so that this is not an issue any longer. There are more such cases waiting for a fix within the remaining 6k cases.

Within the next days, I will have a look at the remaining "deleted sitelinks" in order to fix them as well. I will also set up a bot task that executes regularly, in order to keep the backlog short.

Thank you for the update and for the great work, @MisterSynergy! \o/

Status update: the backlog of sitelinks to inexistent pages is cleared, except for:

  • Sitelinks to wikis that have been closed (their status is undetermined anyways; number of cases is unknown)
  • Sitelinks to Special pages, which appear as inexistent in some contexts but actually exist (these should not happen per guidelines, but there are ~1000 of such sitelinks currently in Wikidata)
  • Sitelinks to User pages where the user has a genered namespace prefix on the client wiki; these pages appear as inexistent in some scenarios as well; ~10 cases)

I do not plan to touch these at the moment.

Besides that, I was able to clear the backlog with a custom script, except for ~75 really obscure cases which needed manual intervention. This means that my bot script is able to deal with almost everything that has shown up in the past.

The statistics provided by me on July 12 above in this task is still valid. The main culprit to my experience are rate-limit issues when pages are deleted on the client wiki at a high rate (admin bot, Special:Nuke, i.e. not ratelimited) so that the sitelink removal on Wikidata cannot keep up. Since almost everything can be fixed automatically, I do not see an urgent need to change anything in the software.

Another status update:

I have now migrated this job from PAWS to Toolforge (msynbot tool account) . Due to memory restrictions on Toolforge, I had to rewrite much of the code unfortunately. The memory-intensive operation is no longer done with Python/pandas; instead I use a temporary tool database so that the operations runs on database servers that are not subject to k8s memory limits. After several test runs, I am confident that there is no memory issue to be expected in the foreseeable future even with the largest wikis.

There is now a weekly k8s-cronjob that should keep the backlog short. I am also continueing to log edits done by the bot so that I can provide some insight into the situations that lead to inexistent sitelinks on item pages if necessary.

Has the option to automatically create the user account prior to the update if it doesn't exist on Wikidata been actually considered? Is it impossible, or is it just unclear whether we want to do it?

Has the option to automatically create the user account prior to the update if it doesn't exist on Wikidata been actually considered? Is it impossible, or is it just unclear whether we want to do it?

Also see

Has the option to automatically create the user account prior to the update if it doesn't exist on Wikidata been actually considered? Is it impossible, or is it just unclear whether we want to do it?

As far as I remember we deliberately chose not to auto-create users back then initially implementing this. I don't think this would be very hard to add this (in a nice way), but I haven't checked.

As far as I remember we deliberately chose not to auto-create users back then initially implementing this. I don't think this would be very hard to add this (in a nice way), but I haven't checked.

Do you by chance remember why? This seems like a sensible thing to do at first sight to me.

As far as I remember we deliberately chose not to auto-create users back then initially implementing this. I don't think this would be very hard to add this (in a nice way), but I haven't checked.

Do you by chance remember why? This seems like a sensible thing to do at first sight to me.

I think this was just a very conservative choice back then (also to make sure that libel account names don't propagate). But this was way back (2014ish), even before the SUL-finalization, so I don't think this still stands.

If we end up implementing this: I just looked this up and MediaWikiServices::getInstance()->getAuthManager()->autoCreateUser should do what we need.

matej_suchanek renamed this task from In some cases, moving or deleting pages on a client wiki does not result in sitelink updates / removal on Wikidata. to In some cases, moving or deleting pages on a client wiki does not result in sitelink updates / removal on Wikidata.Fri, Apr 5, 8:07 AM