Page MenuHomePhabricator

testLinks RuntimeError: Found multiple matches for URL
Closed, ResolvedPublic

Description

https://travis-ci.org/wikimedia/pywikibot-core/jobs/210337931

___________________________ TestPageObject.testLinks ___________________________

self = <tests.page_tests.TestPageObject testMethod=testLinks>

    def testLinks(self):

        """Test the different types of links from a page."""

        if self.site.family.name == 'wpbeta':

            raise unittest.SkipTest('Test fails on betawiki; T69931')

        mainpage = self.get_mainpage()

        for p in mainpage.linkedPages():

            self.assertIsInstance(p, pywikibot.Page)

>       iw = list(mainpage.interwiki(expand=True))

tests/page_tests.py:495: 

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

pywikibot/page.py:1425: in interwiki

    if link.site != self.site:

pywikibot/page.py:5360: in site

    self.parse()

pywikibot/page.py:5266: in parse

    newsite = self._site.interwiki(prefix)

pywikibot/site.py:949: in interwiki

    return self._interwikimap[prefix].site

pywikibot/site.py:705: in __getitem__

    raise self._iw_sites[prefix].site

pywikibot/site.py:668: in site

    self._site = pywikibot.Site(url=self.url)

pywikibot/__init__.py:854: in Site

    code = family.from_url(url)

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Family("wsbeta"), url = 'https://wikisource.org/wiki/$1'

    def from_url(self, url):

        """

            Return whether this family matches the given url.

    

            It is first checking if a domain of this family is in the the domain of

            the URL. If that is the case it's checking all codes and verifies that

            a path generated via L{APISite.article_path} and L{Family.path} matches

            the path of the URL together with the hostname for that code.

    

            It is using L{Family.domains} to first check if a domain applies and

            then iterates over L{Family.codes} to actually determine which code

            applies.

    

            @param url: the URL which may contain a C{$1}. If it's missing it is

                assumed to be at the end and if it's present nothing is allowed

                after it.

            @type url: str

            @return: The language code of the url. None if that url is not from

                this family.

            @rtype: str or None

            @raises RuntimeError: When there are multiple languages in this family

                which would work with the given URL.

            @raises ValueError: When text is present after $1.

            """

        if self._ignore_from_url is True:

            return None

        else:

            ignored = self._ignore_from_url

    

        parsed = urlparse.urlparse(url)

        if not re.match('^(https?)?$', parsed.scheme):

            return None

        path = parsed.path

        if parsed.query:

            path += '?' + parsed.query

    

        # Discard $1 and everything after it

        path, _, suffix = path.partition('$1')

        if suffix:

            raise ValueError('Text after the $1 placeholder is not supported '

                             '(T111513).')

    

        matched_sites = []

        for domain in self.domains:

            if domain in parsed.netloc:

                break

        else:

            domain = False

        if domain is not False:

            for code in self.codes:

                if code in ignored:

                    continue

                if self._hostname(code)[1] == parsed.netloc:

                    # Use the code and family instead of the url

                    # This is only creating a Site instance if domain matches

                    site = pywikibot.Site(code, self.name)

                    pywikibot.log('Found candidate {0}'.format(site))

    

                    if path in site._interwiki_urls():

                        matched_sites += [site]

    

        if len(matched_sites) == 1:

            return matched_sites[0].code

        elif not matched_sites:

            return None

        else:

            raise RuntimeError(

                'Found multiple matches for URL "{0}": {1}'

>               .format(url, ', '.join(str(s) for s in matched_sites)))

E           RuntimeError: Found multiple matches for URL "https://wikisource.org/wiki/$1": wsbeta:tyv, wsbeta:hak, wsbeta:bxr, wsbeta:azb, wsbeta:io, wsbeta:fur, wsbeta:bug, wsbeta:myv, wsbeta:glk, wsbeta:srn, wsbeta:ast, wsbeta:vro, wsbeta:kg, wsbeta:fiu-vro, wsbeta:nrm, wsbeta:gv, wsbeta:pdc, wsbeta:kaa, wsbeta:ba, wsbeta:rn, wsbeta:mn, wsbeta:bm, wsbeta:lij, wsbeta:ti, wsbeta:cu, wsbeta:so, wsbeta:ie, wsbeta:ng, wsbeta:lmo, wsbeta:kl, wsbeta:sn, wsbeta:fy, wsbeta:wuu, wsbeta:bat-smg, wsbeta:lg, wsbeta:lzh, wsbeta:szl, wsbeta:bjn, wsbeta:mzn, wsbeta:sc, wsbeta:ur, wsbeta:xal, wsbeta:lez, wsbeta:km, wsbeta:nov, wsbeta:kbd, wsbeta:dz, wsbeta:om, wsbeta:ckb, wsbeta:sq, wsbeta:hi, wsbeta:pms, wsbeta:ps, wsbeta:uz, wsbeta:vo, wsbeta:pcd, wsbeta:got, wsbeta:kv, wsbeta:krc, wsbeta:ii, wsbeta:lbe, wsbeta:cho, wsbeta:nds, wsbeta:map-bms, wsbeta:arc, wsbeta:cr, wsbeta:ha, wsbeta:aa, wsbeta:wo, wsbeta:pa, wsbeta:ksh, wsbeta:kw, wsbeta:sh, wsbeta:lo, wsbeta:tw, wsbeta:egl, wsbeta:vls, wsbeta:pih, wsbeta:ku, wsbeta:nn, wsbeta:tt, wsbeta:bpy, wsbeta:ch, wsbeta:rup, wsbeta:rue, wsbeta:eu, wsbeta:lrc, wsbeta:zh-classical, wsbeta:roa-tara, wsbeta:hz, wsbeta:ho, wsbeta:mus, wsbeta:sw, wsbeta:gag, wsbeta:am, wsbeta:mg, wsbeta:eml, wsbeta:tpi, wsbeta:ab, wsbeta:ny, wsbeta:tk, wsbeta:ilo, wsbeta:nap, wsbeta:ace, wsbeta:cbk-zam, wsbeta:ve, wsbeta:ss, wsbeta:tg, wsbeta:ady, wsbeta:mai, wsbeta:ay, wsbeta:iu, wsbeta:sd, wsbeta:mwl, wsbeta:qu, wsbeta:scn, wsbeta:gan, wsbeta:zh-yue, wsbeta:olo, wsbeta:frr, wsbeta:ceb, wsbeta:na, wsbeta:ak, wsbeta:ga, wsbeta:tum, wsbeta:pnb, wsbeta:frp, wsbeta:gn, wsbeta:rw, wsbeta:ik, wsbeta:yo, wsbeta:se, wsbeta:ext, wsbeta:gom, wsbeta:hif, wsbeta:cv, wsbeta:pag, wsbeta:ln, wsbeta:ka, wsbeta:sco, wsbeta:bcl, wsbeta:diq, wsbeta:koi, wsbeta:mh, wsbeta:sgs, wsbeta:oc, wsbeta:arz, wsbeta:nv, wsbeta:ug, wsbeta:dv, wsbeta:rm, wsbeta:csb, wsbeta:mdf, wsbeta:chy, wsbeta:av, wsbeta:nds-nl, wsbeta:ltg, wsbeta:ms, wsbeta:ne, wsbeta:sm, wsbeta:cdo, wsbeta:to, wsbeta:min, wsbeta:jv, wsbeta:mt, wsbeta:ff, wsbeta:jbo, wsbeta:zea, wsbeta:ts, wsbeta:wa, wsbeta:tl, wsbeta:udm, wsbeta:kk, wsbeta:mhr, wsbeta:tn, wsbeta:nso, wsbeta:lad, wsbeta:sg, wsbeta:mrj, wsbeta:af, wsbeta:haw, wsbeta:si, wsbeta:mi, wsbeta:xmf, wsbeta:gsw, wsbeta:pfl, wsbeta:rmy, wsbeta:jam, wsbeta:ky, wsbeta:ee, wsbeta:su, wsbeta:pnt, wsbeta:pi, wsbeta:nah, wsbeta:an, wsbeta:bar, wsbeta:new, wsbeta:mo, wsbeta:kj, wsbeta:ig, wsbeta:st, wsbeta:chr, wsbeta:my, wsbeta:bo, wsbeta:war, wsbeta:fj, wsbeta:gd, wsbeta:lv, wsbeta:hsb, wsbeta:pam, wsbeta:os, wsbeta:tcy, wsbeta:roa-rup, wsbeta:als, wsbeta:xh, wsbeta:tet, wsbeta:yue, wsbeta:zu, wsbeta:simple, wsbeta:bh, wsbeta:ty, wsbeta:ia, wsbeta:crh, wsbeta:ce, wsbeta:ki, wsbeta:kab, wsbeta:lb, wsbeta:co, wsbeta:bi, wsbeta:dsb, wsbeta:vep, wsbeta:pap, wsbeta:stq, wsbeta:kr, wsbeta:ks, wsbeta:za

pywikibot/family.py:1228: RuntimeError

Event Timeline

The problem can be minimized to:

(Make sure wsbeta-family.py is already created using the python -m generate_family_file 'http://en.wikisource.beta.wmflabs.org/' 'wsbeta' 'y' command)

>>> site = Site('en', 'wikipedia')
>>> site.interwiki('oldwikisource') # takes some time
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "...ibot-core\pywikibot\site.py", line 944, in interwiki
    return self._interwikimap[prefix].site
  File "...ibot-core\pywikibot\site.py", line 699, in __getitem__
    if isinstance(self._iw_sites[prefix].site, BaseSite):
  File "...ibot-core\pywikibot\site.py", line 667, in site
    self._site = pywikibot.Site(url=self.url)
  File "...ibot-core\pywikibot\__init__.py", line 854, in Site
    code = family.from_url(url)
  File "...ibot-core\pywikibot\family.py", line 1228, in from_url
    .format(url, ', '.join(str(s) for s in matched_sites)))
RuntimeError: Found multiple matches for URL "https://wikisource.org/wiki/$1": wsbeta:vep, wsbeta:kbd, wsbeta:mt, wsbeta:srn, wsbeta:pap, wsbeta:sg, wsbeta:pcd, wsbeta:su, wsbeta:war, wsbeta:bug, wsbeta:ksh, wsbeta:ki, wsbeta:olo, wsbeta:kw, wsbeta:stq, wsbeta:nds, wsbeta:ky, wsbeta:gsw, wsbeta:pam, wsbeta:bat-smg, wsbeta:pa, wsbeta:tg, wsbeta:dv, wsbeta:egl, wsbeta:hak, wsbeta:ng, wsbeta:bpy, wsbeta:sn, wsbeta:rmy, wsbeta:lmo, wsbeta:ln, wsbeta:lb, wsbeta:lbe, wsbeta:bm, wsbeta:mo, wsbeta:arz, wsbeta:pms, wsbeta:arc, wsbeta:ext, wsbeta:lad, wsbeta:nap, wsbeta:ab, wsbeta:nv, wsbeta:wuu, wsbeta:an, wsbeta:nah, wsbeta:ie, wsbeta:ny, wsbeta:ve, wsbeta:frp, wsbeta:ka, wsbeta:sq, wsbeta:rm, wsbeta:ts, wsbeta:lij, wsbeta:yo, wsbeta:ik, wsbeta:min, wsbeta:oc, wsbeta:si, wsbeta:bjn, wsbeta:cr, wsbeta:roa-rup, wsbeta:ch, wsbeta:ps, wsbeta:ku, wsbeta:mi, wsbeta:hif, wsbeta:pdc, wsbeta:cu, wsbeta:tk, wsbeta:chy, wsbeta:nrm, wsbeta:yue, wsbeta:ne, wsbeta:wo, wsbeta:sc, wsbeta:ee, wsbeta:nov, wsbeta:dsb, wsbeta:ba, wsbeta:sgs, wsbeta:glk, wsbeta:bi, wsbeta:eu, wsbeta:nds-nl, wsbeta:ce, wsbeta:om, wsbeta:av, wsbeta:mai, wsbeta:szl, wsbeta:chr, wsbeta:got, wsbeta:gag, wsbeta:jam, wsbeta:mg, wsbeta:xmf, wsbeta:hi, wsbeta:fur, wsbeta:zu, wsbeta:ss, wsbeta:cbk-zam, wsbeta:ha, wsbeta:rn, wsbeta:tt, wsbeta:jbo, wsbeta:gan, wsbeta:pnt, wsbeta:zea, wsbeta:eml, wsbeta:af, wsbeta:ga, wsbeta:bar, wsbeta:rup, wsbeta:vro, wsbeta:io, wsbeta:tn, wsbeta:tcy, wsbeta:als, wsbeta:am, wsbeta:ak, wsbeta:ast, wsbeta:simple, wsbeta:gv, wsbeta:ilo, wsbeta:bh, wsbeta:pih, wsbeta:aa, wsbeta:myv, wsbeta:mus, wsbeta:wa, wsbeta:lrc, wsbeta:xal, wsbeta:ay, wsbeta:cv, wsbeta:lg, wsbeta:ug, wsbeta:ady, wsbeta:vls, wsbeta:pfl, wsbeta:tl, wsbeta:fj, wsbeta:se, wsbeta:pag, wsbeta:so, wsbeta:ur, wsbeta:lv, wsbeta:ace, wsbeta:ff, wsbeta:mrj, wsbeta:ii, wsbeta:diq, wsbeta:ltg, wsbeta:sw, wsbeta:my, wsbeta:gn, wsbeta:ia, wsbeta:xh, wsbeta:kg, wsbeta:lez, wsbeta:tet, wsbeta:to, wsbeta:sh, wsbeta:new, wsbeta:pi, wsbeta:rw, wsbeta:scn, wsbeta:gom, wsbeta:tw, wsbeta:roa-tara, wsbeta:cho, wsbeta:fiu-vro, wsbeta:km, wsbeta:krc, wsbeta:kv, wsbeta:mh, wsbeta:ty, wsbeta:uz, wsbeta:os, wsbeta:csb, wsbeta:kab, wsbeta:sco, wsbeta:zh-classical, wsbeta:koi, wsbeta:iu, wsbeta:ho, wsbeta:ti, wsbeta:cdo, wsbeta:sd, wsbeta:mzn, wsbeta:kr, wsbeta:azb, wsbeta:kl, wsbeta:st, wsbeta:vo, wsbeta:haw, wsbeta:hsb, wsbeta:map-bms, wsbeta:tyv, wsbeta:dz, wsbeta:mn, wsbeta:kk, wsbeta:sm, wsbeta:ms, wsbeta:fy, wsbeta:mwl, wsbeta:co, wsbeta:crh, wsbeta:udm, wsbeta:bo, wsbeta:tpi, wsbeta:lo, wsbeta:nso, wsbeta:qu, wsbeta:zh-yue, wsbeta:ig, wsbeta:nn, wsbeta:jv, wsbeta:kaa, wsbeta:lzh, wsbeta:mhr, wsbeta:mdf, wsbeta:kj, wsbeta:ks, wsbeta:frr, wsbeta:hz, wsbeta:bxr, wsbeta:pnb, wsbeta:rue, wsbeta:ceb, wsbeta:tum, wsbeta:na, wsbeta:za, wsbeta:ckb, wsbeta:gd, wsbeta:bcl

Some observations:

oldwikisource is mapped to https://wikisource.org/wiki/$1 via siteinfo:

>>> next(filter(lambda i: i['prefix'] == 'oldwikisource', site.siteinfo['interwikimap']))['url']
'https://wikisource.org/wiki/$1'

>>> next(filter(lambda i: i['prefix'] == 'wikisource', site.siteinfo['interwikimap']))['url']
'https://en.wikisource.org/wiki/$1'

At least part of the problem relies in the generated family file:

self.langs = {
    'en': 'en.wikisource.beta.wmflabs.org',
    'aa': 'wikisource.org',
    'ab': 'wikisource.org',
    'ace': 'wikisource.org',
    'ady': 'wikisource.org',
    'af': 'wikisource.org',
    'ak': 'wikisource.org',
    'als': 'wikisource.org',
    'am': 'wikisource.org',
    'an': 'wikisource.org',
    'ang': 'ang.wikisource.org',
    'ar': 'ar.wikisource.org',
    'arc': 'wikisource.org',

Many language codes have duplicate values in this dictionary.

When a beta cluster for language does not exist, for example https://aa.wikisource.org/wiki/, it will be redirected to https://wikisource.org/wiki/Main_Page. The result is those duplicate values in self.lang.

Here is what's happening:

  1. site.interwiki('oldwikisource') is called
  2. After a few internal calls, site._iw_sites entries are looked up to find any site that matches the requested prefix.
  3. _iw_sites first looks into siteinfo['interwikimap'] to find the URL for the requested prefix. It will see that the url for oldwikisource is https://wikisource.org/wiki/$1.
  4. Then it tries to create a site using the found URL using the _pywikibot.Site(url=self.url) command.
  5. Site tries to use Family.from_url(url) to find the code for the given URL.
  6. from_url uses the codes in Family.langs (provided by the generated family file) to find a matching code.
  7. As stated earlier, generated family file contains multiple duplicate values. Therefore multiple codes are found.
  8. from_url is sensitive to multiple matched codes and raises the RuntimeError.

Here are some possible solutions:

  1. When generating the family file, somehow eliminate duplicate values in Family.langs and hope that it won't cause other issues.
  2. Don't raise RuntimeError inside from_url, instead just warn the user and use the first matching code. Sounds a little unsafe.
  3. Don't run this test on wsbeta. Wait for T120427 and other related issues on beta cluster to be resolved. The problem is that if someone tries to generate a family file for wsbeta, they may face the same issues that we are having here (even on English Wikipedia). They may need to delete their generated family file.

I'm going with the third option here.

Change 342449 had a related patch set uploaded (by Dalba):
[pywikibot/core] page_tests.py: Skip testLinks on wsbeta

https://gerrit.wikimedia.org/r/342449

Change 342449 merged by jenkins-bot:
[pywikibot/core] page_tests.py: Skip testLinks on wsbeta

https://gerrit.wikimedia.org/r/342449

Change 582016 had a related patch set uploaded (by Dvorapa; owner: Dvorapa):
[pywikibot/core@master] [tests] Re-enable some upstream-fixed Beta Cluster tests

https://gerrit.wikimedia.org/r/582016

Beta Cluster iw map fixed, no longer produces broken family file

Change 582016 merged by jenkins-bot:
[pywikibot/core@master] [tests] Re-enable some upstream-fixed Beta Cluster tests

https://gerrit.wikimedia.org/r/582016