Page MenuHomePhabricator

pywikibot's interwikidata.py can't handle projects where the API-reported wikiID differs from the project's globalID
Open, Needs TriagePublicBUG REPORT

Description

This should probably be tagged as "Pywikibot-interwikidata.py" but there doesn't seem to be an available tag for that item.

The Wikibase extension and the Pywikibot-interwikidata.py script both contain strict hard-coded assumptions which, while likely valid on WMF wikis, may break on third-party wikis:

  • T172076: The code assumes that the GlobalID naming convention will be (language code)+(group name) with any hyphens replaced with underscores, It also hard-codes an assumption that the (group name) will always be "wiki*" or "wiktionary" (as WMF project names) and removing that trailing group name will yield the local language code.
  • T221550 : The API and core code assume the local database name (wikiID) can be reported to API clients as a presumed-standard GlobalID which is consistent in format, unique across that entire project and follows all naming conventions. (This won't be fixed at the API level until GlobalID exists in core MW code and, even then, good luck getting externally-hosted projects to update their configs.)
  • T221556 : Furthermore, interwikidata.py assumes there are no individual language wikis in the group which are independently hosted (or which lack access to the common repository). The script takes a list of interwikis from the article, makes an API query for each to see if it's already linked to an item, so that it may treat anything already linked to some other Wikibase Q-item as an interlanguage link conflict. Unfortunately, if the API responds that there is no Wikibase at all behind one language's site, the script does not even attempt to handle this condition and immediately exits - when the proper behaviour should be to treat a "We don't have a lord. We're an autonomous collective." response as there being no conflicting Q-item link on the remote wiki (so OK, no error).

Even if these issues are fixed locally, one problem remains: any externally-hosted wikis will be returning their local database name as WikiID - and that won't match the GlobalID.

That's happening because interwikidata.py presumes the API is providing a GlobalID while the API presumes there is no GlobalID support in core and returns the local database name. That's a design flaw; there are workarounds in other places (such as $wgWBRepoSettings['localClientDatabases']['ptuncyc']='uncyc_pt'; in the Wikibase-repo extension config) but there's no table to map the local database names to the API WikiID to the pywikibot/site.py (which is blindly expecting the WikiID to actually be the GlobalID, always).

Steps to Reproduce:

Install and try to run Pywikibot-interwiki.py on Uncyclopedia. (This will require patching code to address T221556 first, which I shall not address here, and the "home wiki" for the bot will need to be set to one of the languages which has access to the repository.)

There's a (somewhat-broken) Wikidata repository on *.uncyclopedia.info but the project is a mess of independently-hosted languages (such as Russian, Polish, Korean), items on external wiki farms (Italian is on Miraheze?) and entire clusters of wikis (*.uncyclopedia.co) which are separate from anything on the repo.

In theory, the Wikibase extension code should be capable of creating an outbound inter-language link to an externally-hosted project if its page and API links are in the sites table. In practice, everything still goes haywire even after the other bugs listed above have been patched (or kludged, or worked around...) as the wikiID being reported by the individual external projects seems to vary widely, depending on who is hosting each individual language.

Actual Results:

Every time a link to the externally-hosted site is found, if the site's API-reported database name doesn't match the expected GlobalID, the script will report "Unknown site:" and the database name reported by the remote API. This prevents the script from creating outbound interlanguage links to that specific externally-hosted site.

Expected Results:

The only easy way to get the desired result (the script can make outbound-only links to externally-hosted languages, even if that doesn't generate a backlink from the external site) is to add a translation table to be consulted in pywikibot/site.py - something like:

def dbName(self):
    """Return the globalID corresponding to this site's internal id."""
    wikiIDmap = {
    'uncy_cs': 'csuncyc',
    'uncy_de': 'deuncyc',
    'uncy_en': 'enuncyc',
    'uncy_es': 'esuncyc',
    'uncy_fr': 'fruncyc',
    'uncy_he': 'heuncyc',
    'uncy_un': 'en_gbuncyc',
    'engbuncyc': 'en_gbuncyc',
    'zhtwuncyc': 'zh_twuncyc',
    'beidipediawiki': 'aruncyc',
    'nonciclopediawiki': 'ituncyc',
    'uncyclopediawiki': 'zh_cnuncyc',
    'uncyclo_pedia': 'kouncyc',
    'nonsensopedia': 'pluncyc',
    'absurd': 'ruuncyc'
    }
    return wikiIDmap.get(self.siteinfo['wikiid'], self.siteinfo['wikiid'])

instead of the original (pywikibot/site.py lines 2727-2729

def dbName(self):
    """Return this site's internal id."""
    return self.siteinfo['wikiid']

This is a kludge. Ultimately, the wikiIDmap needs to exist as part of the configuration file, perhaps user-config.py or user-added to the generated uncyclopedia-family.py file.

The current code is relying on the API to be returning GlobalID and the GlobalID concept (per T221550) simply doesn't exist in the API because it doesn't exist in core code. WMF is a closed, controlled environment where the local database names follow one, specific known pattern that matches the GlobalID. A third-party external site? Don't count on anything.

Event Timeline

Carlb created this task.Apr 27 2019, 9:38 PM
Restricted Application added subscribers: pywikibot-bugs-list, revi, Aklapper. · View Herald TranscriptApr 27 2019, 9:38 PM
Carlb updated the task description. (Show Details)Apr 27 2019, 9:42 PM
Carlb updated the task description. (Show Details)Apr 27 2019, 9:46 PM
Carlb added a project: Wikidata.

Okay, I'm not sure if I understand it correctly, I also don't really know, what GlobalID and wikiID is and what's the difference between them (could you please explain briefly?). From Pywikibot side, we can do two things I think: a) add a Uncy family file into the Pywikibot library, where you can rewrite the site.py's dbName by yours dbName easily or b) improve generate_family_files.py allowing to add some dbName corrections

In T221556 we are still waiting for some steps to reproduce to find, where in code the issue is located

The GlobalID is the name of the wiki, as it appears in Wikidata. For instance, "enwiki" is the English-language Wikipedia. A Wikidata entry with the individual GlobalID for each wiki looks like https://www.wikidata.org/wiki/Q2736

Wikipedia (204 entries):

  abwiki Ашьапылампыл
  acewiki Sipak bhan
  adywiki Лъэпэеу
  afwiki Sokker
  alswiki Fussball
  amwiki እግር ኳስ
  angwiki Gyldfōtþōðer
  anwiki Fútbol
  arwiki كرة القدم
  arzwiki كورة قدم
  astwiki Fútbol
  [...]

These are the prefixes being submitted to the Wikibase repository whenever an interlanguage link is added. If you view a Q-item on Wikidata (and turn JavaScript off) these tags are visible on all the language links on every record.

If the bot is generating interwikidata links, this is the prefix it needs.

The WikiID, on the other hand, is just an item returned to us by the API on each individual wiki giving the internal name on that server. For instance, https://en.uncyclopedia.co/w/api.php?format=json&meta=siteinfo&action=query reports:

server	"//en.uncyclopedia.co"
servername	"en.uncyclopedia.co"
wikiid	"uncy_en"

That WikiID is hardcoded to the database name (or database-prefix name) on the server. There's no way to change it, short of renaming the underlying server's database or changing API/core code. Unless we have a direct database connection (such as the replica databases on wmflabs) it's pretty much meaningless to us. We need the GlobalID, because that's what will be fed as data to Wikibase.

WMF names its server databases to match the GlobalID, but on a third-party site this field might contain anything. Do we care what SQL names the database on the remote server? We just want to know what prefixes to submit to create the Q-item.

Thank you for your explanation. You want to say that Pywikibot does not distinguish between wikiID and GlobalID?

Yes. Pywikibot blindly trusts that whatever wikiID is supplied by the remote wiki's API is indeed going to exactly match the GlobalID.

That may well be true at WMF, but a third-party wiki could be naming their local server's databases just about anything.

Okay, what is the way to get GlobalID from API?

T221550 says that there is no GlobalID available from API because there is no support for GlobalID in the core code.

According to that task, "Anomie: As far as MediaWiki-API is concerned, this is blocked on someone moving this "global ID" concept into core and probably having wfWikiID() itself return that ID. If that were done, the API would follow naturally. I'd recommend you see if someone from Wikidata would be interested in doing that move."

Hence the need to get globalID from a translation table or a config file. It's not in the API and it's not in MW core.

Dvorapa added a comment.EditedApr 27 2019, 11:51 PM

So it seems Pywikibot has no option to get GlobalID? Then the only option is to presume it is same as wikiID and allow users to specify it (rewrite it) in the family file. What's wikiID good for anyway? Is it used somewhere or it just serves as an unique wiki identificator and nothing else?

WikiID is useful internally, within that one MediaWiki instance, as it's unique within one database server. It's not useful to an external process (such as interwikidata.py) which has no direct connection to the SQL database.

It looks best, for this application, to get GlobalID and work with that - as that's what we ultimately have to submit as data to the repository.

Okay, now I see it more clearly and I'm more confident in recommending what I said above:

a) add a Uncy family file into the Pywikibot library, where you can rewrite the site.py's dbName by yours dbName (GlobalID) easily or b) improve generate_family_files.py allowing to add some dbName corrections

Because I don't think we need a separate GlobalID and wikiID as Pywikibot uses it just as a unique identifier and GlobalID

BTW are wikiID/GlobalID the official names? Or why do we call it dbName? What would be best name for unique identifier/GlobalID in Pywikibot?

And I also recommend to link dbName docstring to MediaWiki GlobalID too.

Xqt added a subscriber: Xqt.Apr 28 2019, 5:16 PM