Page MenuHomePhabricator

UnicodeDecodeError: 'utf8' codec can't decode byte 0xfb in position 12: invalid start byte
Open, LowPublic

Description

Probable bad attribution between unicode and utf8.

pwb.py interwiki -wiktionary -lang:pt -auto -async -cleanup -pt:1 -cat:!Entrada_(Francês) -debug

No changes needed on page [[el:août]]
Updating links on page [[fi:août]].
Changes to be made: Bot: Adding [[an:août]], [[ie:août]]
@@ -18,0 +19 @@
+ [[an:août]]

@@ -38,0 +40 @@
+ [[ie:août]]

@@ -66 +68 @@
- [[zh:août]]
+ [[zh:août]]

NOTE: Updating live wiki...
Updating links on page [[pt:août]].
Changes to be made: Bot: Adding [[an:août]], [[ie:août]]
Dump pt (wiktionary) appended.
Traceback (most recent call last):
  File "C:\Work\pywikipedia\pwb.py", line 239, in <module>
    if not main():
  File "C:\Work\pywikipedia\pwb.py", line 233, in main
    run_python_file(filename, [filename] + args, argvu, file_package)
  File "C:\Work\pywikipedia\pwb.py", line 111, in run_python_file
**    Page [[fi:août]] saved**
main_mod.__dict__)
  File ".\scripts\interwiki.py", line 2641, in <module>
    main()
  File ".\scripts\interwiki.py", line 2616, in main
    bot.run()
  File ".\scripts\interwiki.py", line 2360, in run
    self.queryStep()
  File ".\scripts\interwiki.py", line 2338, in queryStep
    subj.finish()
  File ".\scripts\interwiki.py", line 1785, in finish
    if self.replaceLinks(page, new):
  File ".\scripts\interwiki.py", line 1945, in replaceLinks
    if not botMayEdit(page):
  File ".\scripts\interwiki.py", line 2432, in botMayEdit
    templates = page.templatesWithParams()
  File "C:\Work\pywikipedia\pywikibot\tools\__init__.py", line 1259, in wrapper
    return obj(*__args, **__kw)
  File "C:\Work\pywikipedia\pywikibot\page.py", line 2034, in templatesWithParam
s
    defaultNamespace=10)
  File "C:\Work\pywikipedia\pywikibot\page.py", line 4681, in __init__
    self._text = url2unicode(self._text, encodings=encodings)
  File "C:\Work\pywikipedia\pywikibot\tools\__init__.py", line 1259, in wrapper
    return obj(*__args, **__kw)
  File "C:\Work\pywikipedia\pywikibot\page.py", line 5282, in url2unicode
    raise firstException
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfb in position 12: invalid
start byte
<type 'exceptions.UnicodeDecodeError'>
CRITICAL: Closing network session.

In log file with -debug:

2015-09-01 22:32:47             api.py, 1961 in             submit: DEBUG    API response received from wiktionary:fi:
{"edit":{"result":"Success","pageid":7993,"title":"ao\u00fbt","contentmodel":"wikitext","oldrevid":2562419,"newrevid":2652380,"newtimestamp":"2015-09-01T21:32:46Z"}}
2015-09-01 22:32:47            site.py, 4555 in           editpage: DEBUG    editpage response: {u'edit': {u'pageid': 7993, u'title': u'ao\xfbt', u'newtimestamp': u'2015-09-01T21:32:46Z', u'contentmodel': u'wikitext', u'result': u'Success', u'oldrevid': 2562419, u'newrevid': 2652380}}
2015-09-01 22:32:47            page.py, 1108 in              _save: INFO     Page [[fi:août]] saved
2015-09-01 22:32:47        __init__.py,  678 in             stopme: DEBUG    stopme() called
2015-09-01 22:32:47        __init__.py,  711 in             stopme: VERBOSE  Dropped throttle(s).
2015-09-01 22:32:47            http.py,   87 in             _flush: CRITICAL Closing network session.
2015-09-01 22:32:47            http.py,   91 in             _flush: VERBOSE  Network session closed.

0xfb seems to be the 'û' character in 'août'. This also seems to be a recent issue as I didn't experience any similar problems with accentuated latin characters in page titles before.

pwb.py version

Pywikibot: [https] r-pywikibot-core.git (e00b43f, g6330, 2015/09/01, 14:40:16, o
k)
Release version: 2.0b3
requests version: 2.7.0
  cacerts: C:\Program Files\Python27\lib\site-packages\requests\cacert.pem
    certificate test: ok
Python: 2.7.10 (default, May 23 2015, 09:44:00) [MSC v.1500 64 bit (AMD64)]
PYWIKIBOT2_DIR: Not set
PYWIKIBOT2_DIR_PWB: C:\Work\pywikipedia
PYWIKIBOT2_NO_USER_CONFIG: Not set

Event Timeline

Malafaya created this task.Sep 1 2015, 10:01 PM
Malafaya raised the priority of this task from to High.
Malafaya updated the task description. (Show Details)
Malafaya added a subscriber: Malafaya.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 1 2015, 10:01 PM
Malafaya set Security to None.
Restricted Application added a subscriber: pywikibot-bugs-list. · View Herald TranscriptSep 1 2015, 10:02 PM

I can reproduce it everytime (at least, for now while the page needs to be updated) with:

pwb.py interwiki -lang:pt -family:wiktionary -wiktionary -auto -cleanup -debug août

XZise added a subscriber: XZise.Sep 1 2015, 10:32 PM

Are you able to determine on which site the page happened? And I'm having trouble to understand how this issue can happen because when the title is bytes it shouldn't be just 0xFB because that is no valid sequence for UTF-8 which is the expected encoding. And if it's unicode it shouldn't be able decode it as it first tries to encode it using ASCII:

>>> u'û'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xzise/.pyenv/versions/2.7.8/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfb' in position 0: ordinal not in range(128)

If you are able to, you could help by adding print(type(title)); print(repr(title)) above the for-loop in url2unicode (which is for me in line 5272).

<type 'unicode'>
u'urlencode:ao%FBt'

page being updated is [[pt:août]]

XZise added a comment.Sep 1 2015, 11:01 PM

Okay thank you that helps a lot. Here are all the steps to understand what is happening: The page août on the Portuguese wiki is using {{urlencode:ao%FBt}}. Now our code is searching through the text for the templates to make sure that it is not protected for bot edits and it picks up {{urlencode:ao%FBt}} as a template. With that it tries to create a Link instance and by doing that tries to decode the percent encoding. Which is why urlencode:ao%FBt is the text you got when printed.

And the rest is straight forward: It encodes that using the site's encoding, tries to handle the percent encoding and then decodes the bytes it got from that again with the site's encoding. And that makes u'urlencode:ao%FBt' first into b'urlencode:ao%FBt' using UTF-8 (as all characters are ASCII characters) it decodes the percent encoding to b'urlencode:ao\xFBt' and then tries to decode it using UTF-8 which does not work as 0xFB alone is no valid UTF-8 character.

Now to fix this particular case (as you've already done) it's possible to just fix the usage in the page as it doesn't make sense to percent encode a percent encoded string.

But while the fault lies by whoever wrote that text and not really by pywikibot I think we need to mitigate that. I don't think it's possible to get percent encoded text in the API as it will use \u00FB instead, so I think we could skip that and that it's probably because previous versions screen scraped an HTML page which might use %-encoded text.

Alternatively we should provide a more sensible output including the original values which would make it more obvious what went wrong in case some page has the same problem in the future.

XZise lowered the priority of this task from High to Low.Sep 1 2015, 11:03 PM

Changing the priority as the original issue is fixed and similar issues can be easily mitigated.