Page MenuHomePhabricator

listpages logging fails with UnicodeEncodeError
Closed, ResolvedPublic

Description

C:\pwb\GIT\core>pwb.py listpages -newpages:10
   1 Zweibruch-Kreuzbruch
   2 Österreichischer Bauherrenpreis 2010
   3 Menhir von Barrocal
   4 Wasserbach (Saalach)
   5 ATP Challenger Punta del Este
   6 Sękowo (Nowy Tomyśl)
--- Logging error ---
Traceback (most recent call last):
  File "C:\Program Files (x86)\Python36-32\lib\logging\__init__.py", line 994, i
n emit
    stream.write(msg)
  File "C:\Program Files (x86)\Python36-32\lib\encodings\cp1252.py", line 19, in
 encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0119' in position
83: character maps to <undefined>
Call stack:
  File "C:\pwb\GIT\core\pwb.py", line 264, in <module>
    if not main():
  File "C:\pwb\GIT\core\pwb.py", line 257, in main
    run_python_file(filename, [filename] + args, argvu, file_package)
  File "C:\pwb\GIT\core\pwb.py", line 120, in run_python_file
    main_mod.__dict__)
  File ".\scripts\listpages.py", line 283, in <module>
    main()
  File ".\scripts\listpages.py", line 261, in main
    pywikibot.stdout(output_list[-1])
  File "C:\pwb\GIT\core\pywikibot\logging.py", line 144, in stdout
    logoutput(text, decoder, newline, STDOUT, **kwargs)
  File "C:\pwb\GIT\core\pywikibot\logging.py", line 105, in logoutput
    logger.log(_level, text, extra=context, **kwargs)
Message: '   6 Sękowo (Nowy Tomyśl)'
Arguments: ()
   7 Punta Open 2018
   8 Ernst Züllig
   9 Österreichischer Bauherrenpreis 2009
  10 BMÖ Bundesverband Materialwirtschaft Einkauf Logistik in Österreich
10 page(s) found

Version info:

C:\pwb\GIT\core>pwb.py version
Pywikibot: [ssh] pywikibot-core (cd5b327, g9118, 2018/02/24, 16:42:40, ok)
Release version: 3.0-dev
requests version: 2.18.4
  cacerts: C:\Program Files (x86)\Python36-32\lib\site-packages\certifi\cacert.pem
    certificate test: ok
Python: 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 17:54:52) [MSC v.1900 32 bit (Intel)]
PYWIKIBOT2_DIR: Not set
PYWIKIBOT2_DIR_PWB: C:\pwb\GIT\core
PYWIKIBOT2_NO_USER_CONFIG: Not set
Config base dir: C:\pwb\GIT\core

Event Timeline

Xqt triaged this task as Medium priority.Feb 26 2018, 9:39 AM
Xqt updated the task description. (Show Details)

Try the following commands in your terminal and see if results differ or not:

>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'utf-8'
>>> sys.stdin.encoding
'utf-8'
>>> import pywikibot
>>> pywikibot.stdout('   6 Sękowo (Nowy Tomyśl)')
   6 Sękowo (Nowy Tomyśl)
>>> pywikibot.stdout(b'   6 S\\u0119kowo (Nowy Tomy\\u015bl)'.decode('unicode-escape'))
   6 Sękowo (Nowy Tomyśl)

@Dalba:

C:\pwb\GIT\core>python
Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 17:54:52) [MSC v.1900 32 bit (Intel)]
 on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'utf-8'
>>> sys.stdin.encoding
'utf-8'
>>> import pywikibot
>>> pywikibot.stdout('   6 Sękowo (Nowy Tomyśl)')
   6 Sękowo (Nowy Tomyśl)
--- Logging error ---
Traceback (most recent call last):
  File "C:\Program Files (x86)\Python36-32\lib\logging\__init__.py", line 994, in emit
    stream.write(msg)
  File "C:\Program Files (x86)\Python36-32\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0119' in Position 83: character maps to <undefined>
Call stack:
  File "<stdin>", line 1, in <module>
  File "C:\pwb\GIT\core\pywikibot\logging.py", line 148, in stdout
    logoutput(text, decoder, newline, STDOUT, **kwargs)
  File "C:\pwb\GIT\core\pywikibot\logging.py", line 109, in logoutput
    logger.log(_level, text, extra=context, **kwargs)
Message: '   6 Sękowo (Nowy Tomyśl)'
Arguments: ()
>>> pywikibot.stdout(b'   6 S\\u0119kowo (Nowy Tomy\\u015bl)'.decode('unicode-es
cape'))
   6 Sękowo (Nowy Tomyśl)
--- Logging error ---
Traceback (most recent call last):
  File "C:\Program Files (x86)\Python36-32\lib\logging\__init__.py", line 994, i
n emit
    stream.write(msg)
  File "C:\Program Files (x86)\Python36-32\lib\encodings\cp1252.py", line 19, in
 encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0119' in position
83: character maps to <undefined>
Call stack:
  File "<stdin>", line 1, in <module>
  File "C:\pwb\GIT\core\pywikibot\logging.py", line 148, in stdout
    logoutput(text, decoder, newline, STDOUT, **kwargs)
  File "C:\pwb\GIT\core\pywikibot\logging.py", line 109, in logoutput
    logger.log(_level, text, extra=context, **kwargs)
Message: '   6 Sękowo (Nowy Tomyśl)'
Arguments: ()
>>>

@Xqt: could you add a breakpoint/print statement to

File "C:\Program Files (x86)\Python36-32\lib\logging\__init__.py", line 994, in emit
  stream.write(msg)

to figure out which stream this is going to? Is this the console or is it trying to log to a file?

I'm wondering whether this is a weird confusion between our windows console code and the new Python 3.6 console code (see https://vstinner.github.io/python36-utf8-windows.html).

In general, I would not expect the 'charmap' codec to be used anywhere anymore with 3.6, but apparently this is not entirely true.

EDIT: I'm wrong there -- the encoding of the contents of files will still use the charset codec by default. So maybe there's a weird logging setup where the file is not explicitly opened with the utf-8 codec?

I suspect it might be imissing in the initialization of file_handler = RotatingFileHandler in bot.py (?)
I have no windows pc so I can't test.

@valhallasw

@Xqt: could you add a breakpoint/print statement to ... to figure out which stream this is going to? Is this the console or is it trying to log to a file?

I added:

stream = self.stream
print('>>>', msg, '<<<')
stream.write(msg)

with result:

>>> 2018-07-05 18:14:49       listpages.py,  261 in               main: STDOUT
  192 Liste der Staatsoberhäupter 937 v. Chr. <<<
 193 Motoko Ishii
>>> 2018-07-05 18:14:49       listpages.py,  261 in               main: STDOUT
  193 Motoko Ishii <<<
 194 Dubičné
>>> 2018-07-05 18:14:49       listpages.py,  261 in               main: STDOUT
  194 Dubičné <<<
--- Logging error ---
Traceback (most recent call last):
  File "C:\Program Files (x86)\Python36-32\lib\logging\__init__.py", line 995, i
n emit
    stream.write(msg)
  File "C:\Program Files (x86)\Python36-32\lib\encodings\cp1252.py", line 19, in
 encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u010d' in position
86: character maps to <undefined>
Call stack:
  File "C:\pwb\GIT\core\pwb.py", line 251, in <module>
    if not main():
  File "C:\pwb\GIT\core\pwb.py", line 244, in main
    run_python_file(filename, [filename] + args, argvu, file_package)
  File "C:\pwb\GIT\core\pwb.py", line 115, in run_python_file
    main_mod.__dict__)
  File ".\scripts\listpages.py", line 283, in <module>
    main()
  File ".\scripts\listpages.py", line 261, in main
    pywikibot.stdout(output_list[-1])
  File "C:\pwb\GIT\core\pywikibot\logging.py", line 148, in stdout
    logoutput(text, decoder, newline, STDOUT, **kwargs)
  File "C:\pwb\GIT\core\pywikibot\logging.py", line 109, in logoutput
    logger.log(_level, text, extra=context, **kwargs)
Message: ' 194 Dubičné'
Arguments: ()
 195 Fräulein Lausbub
>>> 2018-07-05 18:14:49       listpages.py,  261 in               main: STDOUT
  195 Fräulein Lausbub <<<
 196 Louis Sosson
>>> 2018-07-05 18:14:49       listpages.py,  261 in               main: STDOUT
  196 Louis Sosson <<<
 197 1. Division (Belgien) 1949/50
>>> 2018-07-05 18:14:49       listpages.py,  261 in               main: STDOUT
  197 1. Division (Belgien) 1949/50 <<<
 198 Yalmakan FC
>>> 2018-07-05 18:14:49       listpages.py,  261 in               main: STDOUT
  198 Yalmakan FC <<<
 199 Goldgrund (Begriffsklärung)
>>> 2018-07-05 18:14:49       listpages.py,  261 in               main: STDOUT
  199 Goldgrund (Begriffsklärung) <<<

and for print(stream) I got
<_io.TextIOWrapper name='C:\\pwb\\GIT\\core\\logs\\listpages-bot.log' mode='a' encoding='cp1252'>

Change 444014 had a related patch set uploaded (by Dalba; owner: dalba):
[pywikibot/core@master] bot.py: Open RotatingFileHandler with utf-8 encoding

https://gerrit.wikimedia.org/r/444014

Xqt claimed this task.

Change 444014 merged by jenkins-bot:
[pywikibot/core@master] bot.py: Open RotatingFileHandler with utf-8 encoding

https://gerrit.wikimedia.org/r/444014