Page MenuHomePhabricator

site.interwiki can be broken by a family that needs _get_path_regex
Closed, ResolvedPublic

Description

generating family files can create break site.interwiki .

$ python pwb.py generate_family_file.py http://wiki-commons.genealogy.net/Hauptseite genealogy2
Generating family file from http://wiki-commons.genealogy.net/Hauptseite

==================================
api url: http://wiki-commons.genealogy.net/w/api.php
MediaWiki version: 1.14.1
==================================

Determining other languages...de en nl

There are 4 languages available.
Do you want to generate interwiki links? This might take a long time. ([y]es/[N]o/[e]dit)y
Loading wikis... 
  * de... 'utf8' codec can't decode byte 0xfc in position 26478: invalid start byte
  * en... downloaded
  * nl... downloaded
  * de... in cache
Writing pywikibot/families/genealogy2_family.py... 
pywikibot/families/genealogy2_family.py already exists. Overwrite? (y/n)y
[jayvdb@localhost new]$ cat pywikibot/families/genealogy2_family.py
# -*- coding: utf-8 -*-
"""
This family file was auto-generated by $Id: 2dd21e4aaf7a93cf8749be841552881a80684b52 $
Configuration parameters:
  url = http://wiki-commons.genealogy.net/Hauptseite
  name = genealogy2

Please do not commit this to the Git repository!
"""

from pywikibot import family

class Family(family.Family):
    def __init__(self):
        family.Family.__init__(self)
        self.name = 'genealogy2'
        self.langs = {
            'nl': 'wiki-nl.genealogy.net',
            'de': 'wiki-commons.genealogy.net',
            'en': 'wiki-en.genealogy.net',
        }



    def scriptpath(self, code):
        return {
            'nl': '/w',
            'de': '/w',
            'en': '/w',
        }[code]

    def version(self, code):
        return {
            'nl': u'1.14.1',
            'de': u'1.14.1',
            'en': u'1.14.1',
        }[code]

That family has three different hostnames, and the keys are different to the subdomain. That might be relevant.

When I alter APISite._cache_interwikimap to re-raise the Error it catches, we see

$ python -m unittest tests.link_tests.TestFullyQualifiedNoLangFamilyImplicitLinkParser.test_fully_qualified_NS1_family
max_retries reduced from 25 to 1 for tests
======================================================================
ERROR: test_fully_qualified_NS1_family (tests.link_tests.TestFullyQualifiedNoLangFamilyImplicitLinkParser)
Test 'wikidata:testwiki:Talk:Q6' on enwp is namespace 1.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/link_tests.py", line 813, in test_fully_qualified_NS1_family
    link.parse()
  File "pywikibot/page.py", line 4189, in parse
    newsite = self._site.interwiki(prefix)
  File "pywikibot/site.py", line 692, in interwiki
    self._cache_interwikimap()
  File "pywikibot/site.py", line 676, in _cache_interwikimap
    site = (pywikibot.Site(url=iw['url']), 'local' in iw)
  File "pywikibot/__init__.py", line 564, in Site
    code = family.from_url(url)
  File "pywikibot/family.py", line 1076, in from_url
    '\$1'.format(self._get_path_regex()), url)
  File "pywikibot/family.py", line 1058, in _get_path_regex
    'family.'.format(self.name))
Error: Pywikibot is unable to generate an automatic path regex for the family genealogy2. It is recommended to overwrite "_get_path_regex" in that family.

----------------------------------------------------------------------
Ran 1 test in 2.645s

FAILED (errors=1)

Event Timeline

jayvdb raised the priority of this task from to Needs Triage.
jayvdb updated the task description. (Show Details)
jayvdb added a project: Pywikibot.
jayvdb added subscribers: Unknown Object (MLST), jayvdb, XZise.

Change 182406 had a related patch set uploaded (by John Vandenberg):
Fix Family._get_path_regex

https://gerrit.wikimedia.org/r/182406

Patch-For-Review

Ehm what is exactly broken? If the family file is not well defined interwiki links to that family will cause an exception. But I don't see how it would usually cause an exception in that test.

jayvdb triaged this task as High priority.May 2 2015, 8:10 AM

Ehm what is exactly broken? If the family file is not well defined interwiki links to that family will cause an exception. But I don't see how it would usually cause an exception in that test.

The biggest problem is __init__'s Site() method's use of code = family.from_url(url) without a try block. That means that since any generated family caused from_url to fail, Site() would also fail instead of continuing to use from_url with other families.

Change 182406 merged by jenkins-bot:
Add Family.from_url support for generated families

https://gerrit.wikimedia.org/r/182406

So the main problem referred to above is now solved.

Just noting the output is now different, without "'utf8' codec can't decode byte 0xfc in position 26478: invalid start byte" which is a bug that I believe was fixed.

And now that family file containing two URLs for 'de'.

cat pywikibot/families/genealogy2_family.py
# -*- coding: utf-8 -*-
"""
This family file was auto-generated by $Id: 4993fd66518a2c61c49b9e1bdf8f4b622459ee34 $
Configuration parameters:
  url = http://wiki-commons.genealogy.net/Hauptseite
  name = genealogy2

Please do not commit this to the Git repository!
"""

from pywikibot import family
from pywikibot.tools import deprecated


class Family(family.Family):
    def __init__(self):
        family.Family.__init__(self)
        self.name = 'genealogy2'
        self.langs = {
            'de': 'wiki-de.genealogy.net',
            'nl': 'wiki-nl.genealogy.net',
            'de': 'wiki-commons.genealogy.net',
            'en': 'wiki-en.genealogy.net',
        }

    def scriptpath(self, code):
        return {
            'de': '/w',
            'nl': '/w',
            'de': '/w',
            'en': '/w',
        }[code]

    @deprecated('APISite.version()')
    def version(self, code):
        return {
            'de': u'1.14.1',
            'nl': u'1.14.1',
            'de': u'1.14.1',
            'en': u'1.14.1',
        }[code]

On both Python 2 & 3, the main 'de' wiki disappears.

>>> pywikibot.Site('de', 'genealogy2').family.langs
{'de': 'wiki-commons.genealogy.net', 'nl': 'wiki-nl.genealogy.net', 'en': 'wiki-en.genealogy.net'}

That may already be a separate bug.

jayvdb lowered the priority of this task from High to Low.Jun 17 2015, 9:44 AM
jayvdb removed a project: Patch-For-Review.
jayvdb set Security to None.

Okay I don't see how that is an issue of _get_path_regex? This is just an oversight in whoever added the second entry to all the dicts. We can nothing do about it without manually reading the file and interpreting it because that is just wrong code.

And regarding the encoding error: The HTML file was just simply encoded incorrectly (using latin-1 instead of utf-8) and I contacted the Wiki admin who fixed that.

jayvdb claimed this task.

Great. I am pretty sure the generated family file issue is already raised separately.

Change 220722 had a related patch set uploaded (by John Vandenberg):
Fix Family.from_url support for generated families

https://gerrit.wikimedia.org/r/220722

Wasnt fixed properly by my first patch.

Change 221439 had a related patch set uploaded (by XZise):
[FEAT] APISite.article_path and redesigned Site(url)

https://gerrit.wikimedia.org/r/221439

Change 221448 had a related patch set uploaded (by XZise):
[FEAT] Fully flexible Site(url)

https://gerrit.wikimedia.org/r/221448

Change 220722 abandoned by John Vandenberg:
Fix Family.from_url support for generated families

Reason:
I4cc8bd7b

https://gerrit.wikimedia.org/r/220722

Change 221448 merged by jenkins-bot:
[FEAT] Fully flexible Site(url)

https://gerrit.wikimedia.org/r/221448

jayvdb reassigned this task from jayvdb to XZise.