Page MenuHomePhabricator

UnicodeDecodeError in color_format()
Closed, ResolvedPublic

Description

Page being updated at the time of the exception was lt:Kategorija:Vaisiai/Baskų kalba

C:\Work\pywikipedia>pwb.py interwiki -lang:sl -family:wiktionary -simulate Kategorija:Sadje_(baskovščina)
Retrieving 1 pages from wiktionary:sl.
[[sl:Kategorija:Sadje (baskovščina)]]: [[sl:Kategorija:Sadje (baskovščina)]] gives new interwiki [[en:Category:eu:Fruits]]
Retrieving 1 pages from wiktionary:en.
[[sl:Kategorija:Sadje (baskovščina)]]: [[en:Category:eu:Fruits]] gives new interwiki [[az:Kateqoriya:Meyvələr (Bask dili)]]
[[sl:Kategorija:Sadje (baskovščina)]]: [[en:Category:eu:Fruits]] gives new interwiki [[es:Categoría:EU:Frutos]]
[[sl:Kategorija:Sadje (baskovščina)]]: [[en:Category:eu:Fruits]] gives new interwiki [[eu:Kategoria:Fruituak euskaraz]]
[[sl:Kategorija:Sadje (baskovščina)]]: [[en:Category:eu:Fruits]] gives new interwiki [[fi:Luokka:Baskin kielen hedelmät]]
[[sl:Kategorija:Sadje (baskovščina)]]: [[en:Category:eu:Fruits]] gives new interwiki [[fr:Catégorie:Fruits en basque]]
[[sl:Kategorija:Sadje (baskovščina)]]: [[en:Category:eu:Fruits]] gives new interwiki [[lt:Kategorija:Vaisiai/Baskų kalba]]
[[sl:Kategorija:Sadje (baskovščina)]]: [[en:Category:eu:Fruits]] gives new interwiki [[pt:Categoria:Fruta (Basco)]]
[[sl:Kategorija:Sadje (baskovščina)]]: [[en:Category:eu:Fruits]] gives new interwiki [[ro:Categorie:Fructe în bască]]
[[sl:Kategorija:Sadje (baskovščina)]]: [[en:Category:eu:Fruits]] gives new interwiki [[ru:Категория:Фрукты/eu]]
[[sl:Kategorija:Sadje (baskovščina)]]: [[en:Category:eu:Fruits]] gives new interwiki [[sv:Kategori:Baskiska/Frukter]]
[[sl:Kategorija:Sadje (baskovščina)]]: [[en:Category:eu:Fruits]] gives new interwiki [[tr:Kategori:Meyve (Baskça)]]
Retrieving 1 pages from wiktionary:lt.
Retrieving 1 pages from wiktionary:tr.
Retrieving 1 pages from wiktionary:az.
Retrieving 1 pages from wiktionary:ro.
Retrieving 1 pages from wiktionary:fr.
Retrieving 1 pages from wiktionary:pt.
Retrieving 1 pages from wiktionary:sv.
Retrieving 1 pages from wiktionary:ru.
Retrieving 1 pages from wiktionary:es.
Retrieving 1 pages from wiktionary:fi.
Retrieving 1 pages from wiktionary:eu.
======Post-processing [[sl:Kategorija:Sadje (baskovščina)]]======
Dump sl (wiktionary) appended.
Traceback (most recent call last):
  File "C:\Work\pywikipedia\pwb.py", line 248, in <module>
    if not main():
  File "C:\Work\pywikipedia\pwb.py", line 242, in main
    run_python_file(filename, [filename] + args, argvu, file_package)
  File "C:\Work\pywikipedia\pwb.py", line 120, in run_python_file
    main_mod.__dict__)
  File ".\scripts\interwiki.py", line 2647, in <module>
    main()
  File ".\scripts\interwiki.py", line 2622, in main
    bot.run()
  File ".\scripts\interwiki.py", line 2366, in run
    self.queryStep()
  File ".\scripts\interwiki.py", line 2344, in queryStep
    subj.finish()
  File ".\scripts\interwiki.py", line 1787, in finish
    if self.replaceLinks(page, new):
  File ".\scripts\interwiki.py", line 1939, in replaceLinks
    '{lightpurple}Updating links on page {0}.{default}', page))
  File "C:\Work\pywikipedia\pywikibot\tools\formatter.py", line 122, in color_format
    return _ColorFormatter().format(text, *args, **kwargs)
  File "C:\Program Files\Python27\lib\string.py", line 559, in format
    return self.vformat(format_string, args, kwargs)
  File "C:\Work\pywikipedia\pywikibot\tools\formatter.py", line 112, in vformat
    kwargs)
  File "C:\Program Files\Python27\lib\string.py", line 563, in vformat
    result = self._vformat(format_string, args, kwargs, used_args, 2)
  File "C:\Program Files\Python27\lib\string.py", line 598, in _vformat
    return ''.join(result)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 28: ordinal not in range(128)
<type 'exceptions.UnicodeDecodeError'>
CRITICAL: Closing network session.



C:\Work\pywikipedia>pwb.py version
Pywikibot: [https] r-pywikibot-core.git (4e1dfad, g6532, 2015/09/22, 12:24:19, ok)
Release version: 2.0b3
requests version: 2.7.0
  cacerts: C:\Program Files\Python27\lib\site-packages\requests\cacert.pem
    certificate test: ok
Python: 2.7.10 (default, May 23 2015, 09:44:00) [MSC v.1500 64 bit (AMD64)]
PYWIKIBOT2_DIR: Not set
PYWIKIBOT2_DIR_PWB: C:\Work\pywikipedia
PYWIKIBOT2_NO_USER_CONFIG: Not set
Config base dir: C:\Work\pywikipedia

Event Timeline

Malafaya raised the priority of this task from to High.
Malafaya updated the task description. (Show Details)
Malafaya added a project: Pywikibot.
Malafaya subscribed.

It's maybe also best to add a test for unicode text in tools_formatter_tests.py.

Well Python greets us with a present that strings.Formatter isn't actually the formatter Python uses. For example when you use unicode's format it returns a unicode. But strings.Formatter doesn't care and returns a bytes. Not sure yet how this actually causing your failure as it seems to be that it tries to decode a bytes instance.

Just for reference afe2555d applied color_format to a lot of cases so using a version before that will help in most cases. The original patch from me is 25980447 which only used it in one instance which isn't used by many scripts (at least not interwiki.py). Alternatively you can try using Python 3.

Here are a few commands to test stuff out:

>>> from pywikibot.tools.formatter import _ColorFormatter as C, color_format
>>> from string import Formatter as F
>>> import pywikibot as py
>>> s = py.Site()
>>> p = py.Page(s, u'ü')
>>> u'%s' % p
WARNING: /home/xzise/.pyenv/versions/2.7/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning

u'[[en:\xdc]]'
>>> u'{0}'.format(p)
u'[[en:\xdc]]'
>>> color_format(u'{0}', p)
'[[en:\xc3\x9c]]'
>>> color_format(u'{red}{0}', p)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pywikibot/tools/formatter.py", line 124, in color_format
    return _ColorFormatter().format(text, *args, **kwargs)
  File "/home/xzise/.pyenv/versions/2.7/lib/python2.7/string.py", line 545, in format
    return self.vformat(format_string, args, kwargs)
  File "pywikibot/tools/formatter.py", line 114, in vformat
    kwargs)
  File "/home/xzise/.pyenv/versions/2.7/lib/python2.7/string.py", line 549, in vformat
    result = self._vformat(format_string, args, kwargs, used_args, 2)
  File "/home/xzise/.pyenv/versions/2.7/lib/python2.7/string.py", line 584, in _vformat
    return ''.join(result)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)

As you can see the %-notation (used before afe2555d) returns an Unicode with the character U+00DC (Ü). The same happens when you use unicode.format. Now color_format without a color “works” too but returns a bytes instance which shouldn't happen. And then when you add a color it crashes. Now to see what it actually returns when a color field is used I use an ASCII title and it works:

>>> color_format(u'{red}{0}', py.Page(s, u'u'))
u'\x03{red}[[en:U]]'
>>> color_format(u'{0}', py.Page(s, u'u'))
'[[en:U]]'
>>> u'{0}'.format(py.Page(s, u'u'))
u'[[en:U]]'
>>> F().format(u'{0}', py.Page(s, u'u'))
'[[en:U]]'

Now there is fun stuff: It actually uses unicode now but as soon as the color field is removed it's back to bytes while unicode.format still works as expected. And at last I verify that it's not _ColorFormatter but instead Formatter which actually returns a bytes.

Okay I seem to have found the culprit. The Formatter allows two levels deep specifications for something like '{0:0{1}}' where the second argument is actually the width of the first one. Now to format the string it splits it up in chunks and buffers the “filled“ chunks in a list and concatenates that list at the end using ''.join(result). Now if one of the elements in result is a unicode it converts that into a unicode. But if the list is empty it returns bytes which is the case for the second round if there are no cascading specifications. This converts a unicode specification onto a bytes specification. And that specification is then used to format the field using the builtin function format which returns unicode if the specification is a unicode and bytes otherwise (as long as the value is not already a unicode afaik).

So as an example Formatter().format(u'{0}{1}', u'a', 'ä'): It is splitting that string up into two parts and then these parts into the name and specification (e.g. u'0' and u''). Then the specification is parsed again which is similar to Formatter().format(u'') and returns a bytes instance so that the specification is now b'' (Python 2 won't show that prefix but just for clarity I add it here). Now it uses the value associated by that name (u'a' for the first entry) and does basically format(u'a', '') which returns u'a'. For the second entry it's format('ä', '') and that returns 'ä' so that result is then [u'a', 'ä'] and it crashes on the concatenation.

At the moment I have to approaches to fix it. Either change format_field to return unicode if the format string is one (independently of the specification). Alternatively I could overwrite _vformat and return a unicode similarly to format_field which would prevent that the specification changes type. While the latter is closer to fixing the actual bug (as it would prevent that format returns bytes for an empty string) it would change a “private” method which isn't part of the official API. Anyway a fix is probably near and I want first to design tests which fail and will be fixed with the patch to be sure that I got it.

Change 240291 had a related patch set uploaded (by XZise):
[FIX] color_format: Only handle unicode strings

https://gerrit.wikimedia.org/r/240291

Change 240291 merged by jenkins-bot:
[FIX] color_format: Only handle unicode strings

https://gerrit.wikimedia.org/r/240291

Should be solved now.