Page MenuHomePhabricator

pywikibot's interwikidata.py won't handle projects where one externally-hosted language doesn't have access to the wikibase repo
Open, Needs TriagePublic

Description

Wikibase and the associated pywikibot scripts tend to make a lot of assumptions about the way a wiki family is structured, ranging from the WMF-style naming convention ( T172076 ) to the database name matching the GlobalID ( T221550 ), which is invariably (ISO language code xx) + a group name (invariably 'wiki*' or 'wiktionary').

In some cases, even if the Wikibase repository allows something (such as making outbound interlanguage links to an externally-hosted site) the Pywikibot scripts apply their own restrictions.

That's not a huge issue for the WMF wiki families (where every inter-language link points to a wiki in the same cluster with SQL access to/from the same repository), but is an issue for third-party projects (such as the Uncyclopedia family) which allow individual languages to host themselves wherever they like.

If the majority of the languages are on one server cluster (for instance, *.uncyclopedia.info) but one language is on an independent server, the core Wikibase repo will find the API for the independent project from the sites table ('kouncyc', 'https://uncyclopedia.kr/wiki/$1', 'https://uncyclopedia.kr/w/$1') and use that API to determine if an article exists on the independent wiki whenever a user manually adds an interlanguage link to the Wikidata repo. (That won't force the externally-hosted wiki to link back to us, but it does centralise the creation of outbound links from the cluster - which is convenient.)

Pywikibot, on the other hand, is less forgiving. When interwikidata.py finds interwiki links on a page, it does this:

def get_items(self):
    """Return all items of pages linked through the interwiki."""
    wd_data = set()
    for iw_page in self.iwlangs.values():
        if not iw_page.exists():
            warning('Interwiki {} does not exist, skipping...'
                    .format(iw_page.title(as_link=True)))
            continue
        try:
            wd_data.add(pywikibot.ItemPage.fromPage(iw_page))
        except pywikibot.NoPage:
            output('Interwiki {} does not have an item'
                   .format(iw_page.title(as_link=True)))
    return wd_data

which causes pywikibot.ItemPage.fromPage() - a call to page.py - to interrogate the API for every language site linked and ask for the repository URL for each.

Wikipedia will likely give sane answers, while an independently-hosted Uncyclopedia will more likely answer to a request for the repository URI like this:

Pywikibot:  Please, please good people.  I am in haste.  Who lives in that castle? 
kouncyc:  No one lives there.  We don't have a lord.  We're an anarcho-syndicalist commune...

descending through:

Pywikibot:  Be quiet!  I order you to be quiet!  I am your king!
kouncyc:  Well, I didn't vote for you.
Pywikibot:  You don't vote for kings.

and, when the "take me to your leader" demands for the identity of a central repository for the externally-hosted project inevitably fail, ending with:

Pywikibot: I am going to throw you so far...
 if not page.site.has_data_repository:
    raise pywikibot.WikiBaseError('{0} has no data repository'
                                  ''.format(page.site))

kouncyc:  Help, help, I'm being oppressed! See the violence inherent in the system...
Pywikibot:  Bloody peasant!
kouncyc: Dead giveaway, that...

Such are the hazards of giving Uncyclopedians a Python script to run. The outcome is a comedy of errors. It's just not particularly useful.

There is no code in interwikidata.py to recover from finding the one independent that has "no lord" (or no central repository access) so, instead of merely adding [[ko:]] as an outbound only link, the whole thing very abruptly ends. There is no handler for this sort of condition:

except pywikibot.WikiBaseError:
    output('Site {} has no Wikibase repository'
           .format(iw_page.title(as_link=True)))

What would be the best way to change this script so that, when it finds the one externally-hosted independent which doesn't have access to the repository, it merely creates the outbound link (one-way) from the projects which are on the cluster with the repository and continues in some sane manner instead of revolting?

Event Timeline

Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald Transcript

Can you provide some steps to reproduce this issue?

Yeah, steps to reproduce would be handy as generate_family_file works well and the generated family file as well (tested on cs:uncy)

This is specific to Pywikibot-interwikidata.py, which requires your 'bot run on a "home" wiki with access to a Wikidata-style (Wikibase) repository. You can't do this on cs:uncyc but it should be possible on pt:uncyc (as one example). OK, here goes:

$ wget https://tools.wmflabs.org/pywikibot/core_stable.zip
$ unzip core_stable.zip
$ cd core_stable
$ python pwb.py generate_family_file
   Please insert URL to wiki: https://data.uncyclomedia.org
   Please insert a short name (eg: freeciv): uncyclopedia
   Generating family file from https://data.uncyclomedia.org
   ==================================
   API url: https://data.uncyclomedia.org/api.php
   MediaWiki version: 1.31.1
   ==================================
   Determining other languages...af ang ar ast bar be bg bn bs ca cmn cs cy da de dlm el en en-gb eo es et fa fi fo fr fy ga gl got grc he hr hu hy id ie io is it ja jv ka km ko kw la lb lfn li lo lt lv mg mk mn mo ms mwl nap nds nl nn no oc olb pl pt pt-br ro ru rue sco simple sk sl sr su sv th tl tlh tr uk vi vls xh yi yue zea zh zh-cn zh-hk zh-tw
   There are 95 languages available. Do you want to generate interwiki links? This might take a long time. ([y]es/[N]o/[e]dit) 
   Writing /var/www/hymie/test_bot/pywikibot/families/uncyclopedia_family.py... 

OK, it created the family file. Now try to run the interwikidata bot:

$ python pwb.py interwikidata -create -merge -start
  NOTE: 'user-config.py' was not found!  Please follow the prompts to create it:  You can abort at any time by pressing ctrl-c
   1: commons
   2: i18n
   3: incubator
  ...
  13: uncyclopedia
  ...
  Select family of sites we are working on, just enter the number or name (default: wikipedia): 13
  This is the list of known languages: af, ang, ar, ast, bar, be, bg, bn, bs, ca, cmn, cs, cy, da, de, dlm, el, en, en-gb, eo, es, et, fa, fi, fo, fr, fy, ga, gl, grc, he, hr, hu, hy, id, ie, io, is, it, ja, jv, ka, km, ko, kw, la, lb, lfn, li, lo, lt, lv, mg, mk, mn, mo, ms, mwl, nap, nds, nl, nn, no, oc, olb, pl, pt, pt-br, ro, ru, rue, sco, simple, sk, sl, sr, su, sv, th, tl, tlh, tr, uk, vi, vls, xh, yi, yue, zea, zh, zh-cn, zh-hk, zh-tw
  The language code of the site we're working on (default: en): pt
  Username on pt:uncyclopedia: Hymie le robot
  Do you want to add any other projects? ([y]es, [N]o): 

OK, it created user-config.py ... now try to run the bot (again):

$ python pwb.py interwikidata -create -merge -start
WARNING: Site "uncyclopedia:pt" supports wikibase at "http://data.uncyclomedia.org//index.php", but creation failed: Unknown URL 'http://data.uncyclomedia.org//index.php'..

Uh, oh. Maybe it doesn't like this (from pywikibot/site.py line 2814):

url = data['base'] + data['scriptpath'] + '/index.php'

so I shall change it to:

url = data['base'] + data['scriptpath'] + 'index.php'

to get rid of that //index.php with the double-slash.

I try the 'bot again:

$ python pwb.py interwikidata -create -merge -start
  Retrieving 50 pages from uncyclopedia:pt.
  >>> "Groovy Guy" Russell <<<
  No interlanguagelinks on [["Groovy Guy" Russell]]
  >>> "Você foi banido da Desciclopédia" na página de usuário <<<
  No interlanguagelinks on [["Você foi banido da Desciclopédia" na página de usuário]]
  >>> "Weird Al" Yankovic <<<
  WARNING: [getLanguageLinks] 2 or more interwiki links point to site uncyclopedia:da.
  WARNING: [getLanguageLinks] 2 or more interwiki links point to site uncyclopedia:en.
  WARNING: [getLanguageLinks] 2 or more interwiki links point to site uncyclopedia:it.
  WARNING: [getLanguageLinks] 2 or more interwiki links point to site uncyclopedia:sv.
  WARNING: [getLanguageLinks] 2 or more interwiki links point to site uncyclopedia:tr.

and it runs for a while (at one point asking to log me into uncyclopedia:en for some reason), until eventually:

>>> ( ͡° ͜ʖ ͡°) <<<
No interlanguagelinks on [[( ͡° ͜ʖ ͡°)]]

>>> * <<<
13 pages read
0 pages written
Execution time: 110 seconds
Read operation time: 8 seconds
Script terminated by exception:
ERROR: WikiBaseError: uncyclopedia:fr has no data repository
Traceback (most recent call last):
  File "pwb.py", line 250, in <module>
    if not main():
  File "pwb.py", line 243, in main
    run_python_file(filename, [filename] + args, argvu, file_package)
  File "pwb.py", line 95, in run_python_file
    main_mod.__dict__)
  File "./scripts/interwikidata.py", line 246, in <module>
    main()
  File "./scripts/interwikidata.py", line 239, in main
    bot.run()
  File "/var/www/hymie/test_bot/pywikibot/bot.py", line 1508, in run
    self.treat(page)
  File "/var/www/hymie/test_bot/pywikibot/bot.py", line 1735, in treat
    self.treat_page()
  File "./scripts/interwikidata.py", line 96, in treat_page
    item = self.try_to_merge(item)
  File "./scripts/interwikidata.py", line 193, in try_to_merge
    wd_data = self.get_items()
  File "./scripts/interwikidata.py", line 165, in get_items
    wd_data.add(pywikibot.ItemPage.fromPage(iw_page))
  File "/var/www/hymie/test_bot/pywikibot/page.py", line 4390, in fromPage
    ''.format(page.site))
pywikibot.exceptions.WikiBaseError: uncyclopedia:fr has no data repository
CRITICAL: Exiting due to uncaught exception <class 'pywikibot.exceptions.WikiBaseError'>

Indeed, fr:uncyc has no Wikibase data repository (as, relative to pt:uncyc, it's hosted externally). That shouldn't prevent me from making outbound links from pt:uncyc's Wikibase repository one-way to fr: but it does. That's a bug in the pywikibot interwikidata.py script.

OK, so what happened?

  • The script retrieved [[pt:*]] and found a bunch of interwikis on that page: [[en:*]] [[`~:*]] [[de:Asterix]] [[fr:Astérix]] [[it:Asterix (fumetto)]] [[ja:*]] [[nl:Asterix]] [[pl:Asterix]]
  • As [[pt:*]] already has a Wikibase item, it tried to follow each of the interwikis on the page to see if they could be merged to the existing item
  • self.try_to_merge(item) calls self.get_items() to retrieve the Wikibase Q-item number for every one of those other pages. Presumably, if it comes back with more than one Q-item number, that's a conflicting link (as appeared in the "Weird Al" Yankovic page example a few lines earlier) so the script will skip those. That seems to be the only reason it's retrieving all those items.
  • get_items() finds no repository at all on fr:uncyc (which is true, because it's an externally-hosted project). It should just treat that as their being no Q-item linked from the French page, but it doesn't do that... it fails to handle the error and exits.

So now what? If scripts/interwikidata.py lines 156-169 look like this:

def get_items(self):
    """Return all items of pages linked through the interwiki."""
    wd_data = set()
    for iw_page in self.iwlangs.values():
        if not iw_page.exists():
            warning('Interwiki {} does not exist, skipping...'
                    .format(iw_page.title(as_link=True)))
            continue
        try:
            wd_data.add(pywikibot.ItemPage.fromPage(iw_page))
        except pywikibot.NoPage:
            output('Interwiki {} does not have an item'
                   .format(iw_page.title(as_link=True)))
    return wd_data

then there's a handler for NoPage but not one for an externally-hosted project having no direct access to the repo.

Change that routine to this and the script will run:

def get_items(self):
    """Return all items of pages linked through the interwiki."""
    wd_data = set()
    for iw_page in self.iwlangs.values():
        if not iw_page.exists():
            warning('Interwiki {} does not exist, skipping...'
                    .format(iw_page.title(as_link=True)))
            continue
        try:
            wd_data.add(pywikibot.ItemPage.fromPage(iw_page))
        except pywikibot.NoPage:
            output('Interwiki {} does not have an item'
                   .format(iw_page.title(as_link=True)))
        except pywikibot.WikiBaseError:
            output('Site {} has no Wikibase repository'
                   .format(iw_page.title(as_link=True)))
    return wd_data

as a WikiBaseError (which will occur if a wiki has no repo access) will be treated the same way as the page containing no existing Wikibase link.

This comment was removed by Carlb.