Page MenuHomePhabricator

Encoding error on the some sites of the IWM
Closed, ResolvedPublic

Description

The following wiki site on the InterWiki Map fails to decode the HTML data:
http://wiki.genealogy.net/index.php/$1

Following is the non-wiki site:
https://lists.wikimedia.org/mailman/listinfo/$1

Following is the traceback produced by detect_site_type:

TRACEBACK:

>>> pywikibot.detect_site_type("http://wiki.genealogy.net/index.php/$1")
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "C:\Users\Acer\Documents\GitHub\core\pywikibot\__init__.py", line 122, in detect_site_type
    data = request.content
  File "C:\Users\Acer\Documents\GitHub\core\pywikibot\comms\threadedhttp.py", line 496, in content
    return self.decode(self.encoding)
  File "C:\Users\Acer\Documents\GitHub\core\pywikibot\comms\threadedhttp.py", line 486, in encoding
    raise self._encoding
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 10989: invalid start byte

SAMPLE CODE:

def detect_site_type(url):
    up = urlparse(url)
    if up.scheme in ('http','https', ''):
        if up.scheme == '':
            url = 'http:' + url
        try:
            request = pywikibot.comms.http.fetch(url)
        except Exception as e:
            return 'Detection Failed : ' + str(e)
        data = request.content
    elif up.scheme == 'ftp':
        return 'Not a wikiengine site - ftp'
    elif up.scheme == 'irc':
        return 'Not a wikiengine site - irc'
    else:
        return 'No scheme satisfied'

    wp = WikiHTMLPageParser()
    wp.feed(data)
    if wp.generator:
        if "MediaWiki" not in wp.generator:
            return 'Not a MediaWiki site.'
        version = wp.generator
        return version
    else:
        return 'generator is empty'


class WikiHTMLPageParser(HTMLParser):

    """Wiki HTML page parser."""

    def __init__(self, *args, **kwargs):
        HTMLParser.__init__(self, *args, **kwargs)
        self.generator = None

    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)
        if tag == "meta":
            if attrs.get("name") == "generator":
                self.generator = attrs["content"]
        if tag == "link":
            if attrs.get("rel") == "EditURI":
                self.edituri = attrs["href"]

Event Timeline

Omegat claimed this task.
Omegat raised the priority of this task from to Needs Triage.
Omegat updated the task description. (Show Details)
Omegat subscribed.
valhallasw subscribed.

The site states it's utf-8 in the http headers:

Content-Type: text/html; charset=utf-8

and even repeats this in the html

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

yet contains both "Verein f\xfcr Computerge" (= Verein für Computerge in latin-1) /and/ "\xc3\x9cber GenWiki" (= "Über GenWiki" in utf-8)

You can either work around this by manually decoding, replacing characters on errors:

data = request.raw.decode(request.encoding, 'replace')

I think raising an Exception when the server provides inconsistent information is sensible, so closing as 'Invalid'.

Sorry, I misunderstood the bug. This is not a bug 'something to fix in pywikibot', but 'something to fix for the interwikimap project'.

I don't this is a valid complaint about pywikibot core, because the file has simply encoded part of the content (at least one character) using the latin1 encoding. But at least the first 10898 bytes are encoded as UTF-8, which is reported in the HTTP and HTML header. The best solution would be to notify the site to fix their output and to contain only UTF-8 characters.

>>> import pywikibot.comms.http
>>> req = pywikibot.comms.http.fetch('http://wiki.genealogy.net/index.php/$1')
WARNING: Http response status 404
>>> req.content
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xzise/Programms/core/pywikibot/comms/threadedhttp.py", line 487, in content
    return self.decode(self.encoding)
  File "/home/xzise/Programms/core/pywikibot/comms/threadedhttp.py", line 477, in encoding
    raise self._encoding
  File "/home/xzise/Programms/core/pywikibot/comms/threadedhttp.py", line 459, in encoding
    self.raw.decode(self.header_encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 10989: invalid start byte
>>> req.raw[:10989].decode('utf-8')
'<!DOCTYPE html
…
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
…
<img src=\'http://www.genealogy.net/startseite/compgen.png\' alt=\'Verein f'

On pywikibot's side it seems the most sensible solution is to change HttpRequest's method decode. It could like Python's decode accept an error argument to handle such errors.

Okay I've contacted the wiki's administrator maybe that get fixed, although this doesn't solve any problems you might encounter with other sites also using different encodings throughout the response.

Thank You for contacting them XZise. For now I have implemented what XZise and valhallasw suggested:

data = request.raw.decode(request.header_encoding, errors='replace')

The two erring sites are running fine with this change and it seems that this should keep any such problem at bay too. So should we implement what XZise suggested? (implement an error handling scheme for HttpRequest's decode method)

I'd be OK with adapting request.decode to take an additional errors parameter, but I don't think turning it on by default is a good idea; encoding errors can easily make a mess of the site contents, and a bot should be conservative in this, by erroring out rather than trying to guess what is right.

Yes, it shouldn't use the error as default. But otherwise I'd say implementing it there would mimic how Python does it.

gerritbot subscribed.

Change 189975 had a related patch set uploaded (by Maverick):
Enabling the error handling in decode method.

https://gerrit.wikimedia.org/r/189975

Patch-For-Review

Change 189975 merged by jenkins-bot:
Enabling the error handling in decode method.

https://gerrit.wikimedia.org/r/189975