Page MenuHomePhabricator

Wikidata edit did not update the langlinks tables on MediaWiki side
Open, Needs TriagePublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):
Here is a piece of python code that uses Pywikibot to query Wikidata and find which Persian Wikipedia page corresponds with the English Wikipedia page Category:Companies based in Edmonton

import pywikibot
page = pywikibot.Page(pywikibot.Site('en'), 'Category:Companies based in Edmonton')
print(page.langlinks())

What happens?:
Here is the response you get:

[pywikibot.page.Link('شركات مقرها في إدمونتون', APISite("ar", "wikipedia")), pywikibot.page.Link('Unternehmen (Edmonton)', APISite("de", "wikipedia")), pywikibot.page.Link('شرکت\u200cهای مستقر در ادمنتن', APISite("fa", "wikipedia"))]

This indicates that the fa equivalent page would be titled "شرکت‌های مستقر در ادمنتن" (don't be confused by \u200c which is the zero-width-non-joiner character)

What should have happened instead?:
Per the Wikidata page and as shown on the right side of the screenshot below, the expected answer would have been "شرکت‌های ادمنتن" (more specifically, "رده:شرکت‌های ادمنتن").

image.png (349×1 px, 28 KB)

It seems like langlinks() is not returning the links, but rather the labels of the descriptions in different languages. Case in point, the namespace prefix is also missing in the responses. This contradicts what the function claims to do: "Return a list of all inter-language Links on this page"

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc:
Tested on 05ce190809b

Event Timeline

Here is my analysis of the issue:

The langlinks() method of the Page class simply calls the pagelanglinks() of the Site class, which is defined in _generators.py. Notably, docstring of pagelanglinks() suggests that it uses API:Langlinks and when I call that API, I get the incorrect answer: https://en.wikipedia.org/w/api.php?action=query&prop=langlinks&titles=Category:Companies%20based%20in%20Edmonton&redirects=

{
    "batchcomplete": "",
    "query": {
        "pages": {
            "6951033": {
                "pageid": 6951033,
                "ns": 14,
                "title": "Category:Companies based in Edmonton",
                "langlinks": [
                    {
                        "lang": "ar",
                        "*": "\u062a\u0635\u0646\u064a\u0641:\u0634\u0631\u0643\u0627\u062a \u0645\u0642\u0631\u0647\u0627 \u0641\u064a \u0625\u062f\u0645\u0648\u0646\u062a\u0648\u0646"
                    },
                    {
                        "lang": "de",
                        "*": "Kategorie:Unternehmen (Edmonton)"
                    },
                    {
                        "lang": "fa",
                        "*": "\u0631\u062f\u0647:\u0634\u0631\u06a9\u062a\u200c\u0647\u0627\u06cc \u0645\u0633\u062a\u0642\u0631 \u062f\u0631 \u0627\u062f\u0645\u0646\u062a\u0646"
                    }
                ]
            }
        }
    }
}

(Note that "\u0631\u062f\u0647:\u0634\u0631\u06a9\u062a\u200c\u0647\u0627\u06cc \u0645\u0633\u062a\u0642\u0631 \u062f\u0631 \u0627\u062f\u0645\u0646\u062a\u0646" decodes to "رده:شرکت‌های مستقر در ادمنتن")

In other words, Pywikibot is not doing anything wrong; it is the API response from MediaWiki that doesn't match the Wikidata item.

Changing tags.

Here is what querying the langlinks table for enwiki shows:

SELECT *
from langlinks
WHERE ll_from = 6951033 -- Category:Companies based in Edmonton

image.png (187×718 px, 7 KB)

Clearly, the langlinks table is not consistent with the Wikidata item. Since the MediaWiki API uses the table, its response also doesn't match the table.

Of note, the fa entry for this item on Wikidata was last edited on 25 December 2020. Somehow that did not update the local langlinks table. Adding @Ladsgroup (it happens that his bot updated that page) mainly because he understands Wikidata much better than I do and may have some intuition about this inconsistency.

Huji renamed this task from Pywikibot returns incorrect result when querying Wikidata to Wikidata edit did not update the langlinks tables on MediaWiki side.Dec 7 2021, 9:57 PM

Not to distract the discussion on this specific case, but it seems like this is not an only case. T190667 from two years ago seems to be about a similar issue. And the answer offered there (namely "don't use langlinks table") is invalid because that is what MediaWiki itself is using for its API (which in turn is what Pywikibot uses to retrieve langlinks for a given page). Even T43387 from nine years ago seems related.

Point being: do we have a process that randomly checks the consistency of MediaWiki langlinks table with Wikidata to detect such anomalies?

Since I'm pinged. Yes. This is sorta expected and due to CAP theorem. The more we become distributed and scale, the discrepancy gets bigger, jobs that trigger the refresh might fail. Might not get queued (used to be 1% of the time, now it's around 0.006% still non-zero) and the most importantly, there will be some latency between the change and it being reflected in Wikipedias. The most important thing is that we should have only one source of truth (SSOT principle) to avoid corruption.

All of that being said, we can definitely do better. In my last project before departing WMDE, I worked on dispatching of changes from Wikidata to Wikipedias (and co) and noticed that the time between changes getting reflected in wikis is quite small but only for changes that don't require re-parsing of a page (e.g. injecting rc entries). The reparsing goes into the same queue of reparsing pages caused by for example template changes (to be a bit more technical: htmlCacheUpdate queue in job runners) and that queue is quite busy (to put it mildly). Depending on the load of or if a used template has been changed, it can take days to reach that change.

So to mitigate and reduce the latency:

  • Have a dedicated job for sitelink changes and for other sitelinks of that item and put them in a dedicated lane
    • Having dedicated queues for higher priority work is more aligned with distributed computing principles (look at the Tannenbaum book)
    • This is not as straightforward as it looks due to internals of Wikibase.
    • I'm not in WMDE to help move it forward.
  • Reduce the load on htmlCacheUpdate. The amount of load on it is unbelievable.
    • There are some tickets already. I remember T250205 from top of my head.
    • I honestly don't know if people really looked into why it's so so massive
    • A low-hanging fruit would be to avoid re-parsing transclusions when inside of <noinclude> has been changed in the template. So updating documentation would not cause a massive cascade of several million reparses.
    • Fixes in this direction improves health and reliability of all of our systems, from ParserCache, to DB load, to appservers, to job runners, to edge caches. It's even greener.
    • But it's a big endeavor and will take years at least.
  • As a terrible band-aid, you can write a bot to listen to rc entries of your wiki with source of wikidata and force a reparse (action=purge) if that injected change is a sitelink change.
    • A complicating factor is that an article can subscribe to sitelink of another item and get notified for that sitelink changes (it's quite common in commons, pun intended). Make sure to avoid reparsing because of that.

HTH

Potentially dumb question: is there a reason MediaWiki's API uses its own langlinks table (rather than Wikidata) as the data source? The wiki pages themselves show the interwiki links based on Wikidata; why doesn't the API use the same source?

None of this would have occurred if the API query I linked above would return the same answers as what you see on wiki (which is what you see on Wikidata).

Three reasons I can think of:

  • Loading an item is heavy and requires a lot of network bandwith inside the dc. Plus it would put a lot of pressure on memcached and databases. See this as a cache layer
  • Client wikis need to override those entries (sometimes) with [[en:Foo]]
  • it allows more complex querying in labs, e.g. articles that have a lot of interwiki links but not in language foo.

Three reasons I can think of:

  • Loading an item is heavy and requires a lot of network bandwith inside the dc. Plus it would put a lot of pressure on memcached and databases. See this as a cache layer
  • Client wikis need to override those entries (sometimes) with [[en:Foo]]

Wouldn't all those apply to the Web version too? But the Web version correctly shows the interwiki links based on Wikidata. If the web version can do it efficiently, why wouldn't the API do the same?

More explicitly stated: when I go to https://en.wikipedia.org/wiki/Category:Companies_based_in_Edmonton I see an interwiki to fa page "رده:شرکت‌های ادمنتن" (which is compatible with Wikidata), but when I go to https://en.wikipedia.org/w/api.php?action=query&prop=langlinks&titles=Category:Companies%20based%20in%20Edmonton&redirects= I get the old fa interwiki as described above. Is there a reason that API returns a different response than the Web version. Is it even good practice?

  • it allows more complex querying in labs, e.g. articles that have a lot of interwiki links but not in language foo.

This one is not relevant to the question about API. I am not proposing that we should do away with the langlinks table in MediaWiki altogether. I am saying the Web and the API version of MediaWiki should use the same source of data (both Wikidata).

The web sidebar links come from content of ParserCache (which gets reparsed in edit/purge requests) but API (by proxy langlinks table) gets updated if purge has its explicit option or on edit. But for most cases, it should work the same and if wikidata purges don't update langlinks, then it's a bug. (While again, I can't help on fixing it, I just can explain how it works)

I appreciate your explanation.

Is there a reason the API should not use ParserCache output?

I can't answer that with 100% certainty but I think it's mostly the organic growth of this part of mediawiki. The only thing to keep in mind is that ParserCache for each page gets evicted after certain time (currently twenty days, seven days for talk pages) and it can be missing for lots of pages that don't get much views. So I suggest that the code look into PC first and if the entry is missing (or expired or not canonical or whatever and whatnot), it should still fall back to langlinks as generating the ParserCache entry is expensive.

I like that plan. I find it way about my paygrade (of 0 as a volunteer) to implement.

Michael added a subscriber: Michael.

I looked into this in context of T299828:

Originally, this seems to indeed having been caused by sometimes not scheduling the correct job if a new sitelink was added. That was fixed, and that fix is deployed since about January 20th. So for sitelinks being added after that, this should not be happening anymore.

To fix pages from before that, it needs a purge with forced links update. This can be done via the api. A null-edit probably also works.

While this should work automatically for future sitelinks being added to Wikidata, keep in mind that the job which triggers that refresh of links has currently a backlog of over 3 days.

Let me know if you find more recent examples of this, where the langlinks weren't updated after a few days.