Page MenuHomePhabricator

input encoding is switched to plain ascii when redirecting output to a file or other commands, mangling non-ascii characters
Closed, ResolvedPublic

Description

OS: Linux

When I redirect the output of the python command that runs the bot to a file or another command, the output simply disappears. I wanted to generate a file based on the listpages.py script, but apparently I'm unable to do so...

> python pwb.py listpages.py -family: -cat:'somecategory'
   1 Page 1
   2 Page 2
  ...
> python pwb.py listpages.py -family: -cat:'somecategory' > filelist.txt
> ls -l filelist.txt
-rw-r--r-- 1 jesus users 0 mar 21 16:55 filelist.txt
> python pwb.py listpages.py -family: -cat:'somecategory' | uniq
>

Using the pipe to pass the output to another command doesn't generate output. The same when redirecting to a file.

Printing text from python directly works, though

> python -c "print 'test'" | uniq
test

Workaround: in user_config.py, add

console_encoding = 'utf-8'

Event Timeline

Ciencia_Al_Poder raised the priority of this task from to Needs Triage.
Ciencia_Al_Poder updated the task description. (Show Details)
Ciencia_Al_Poder added a project: Pywikibot.
Ciencia_Al_Poder subscribed.
Restricted Application added subscribers: Aklapper, Unknown Object (MLST). · View Herald TranscriptMar 21 2015, 4:10 PM

That is weird. On my laptop with bash it does work as expected:

xzise@localhost:~/Programms/pywikibot/core$ python pwb.py listpages -cat:Metatemplates
   1 Celestial Bodies/Link
   2 Celestial period table/row
…
xzise@localhost:~/Programms/pywikibot/core$ python pwb.py listpages -cat:Metatemplates | grep Info
   6 Infobox/Kerbonaut/bar
   7 Infobox/Line
xzise@localhost:~/Programms/pywikibot/core$ python pwb.py listpages -cat:Metatemplates > test_out
xzise@localhost:~/Programms/pywikibot/core$ cat test_out 
   1 Celestial Bodies/Link
   2 Celestial period table/row
…
xzise@localhost:~/Programms/pywikibot/core$ python -c "print('test')" | uniq
test

I've just discovered that when using categories with non-ascii characters on it, the redirection is lost. If the category contains plain ascii letters the output is redirected successfully:

jesus@charmander:~/git/mediawiki/pywikibot/core> python pwb.py listpages.py -family:wikipedia -lang:es -cat:'.hack'
   1 .hack
   2 .hack//G.U.
   3 .hack//Liminality
   4 .hack//Roots
   5 .hack//SIGN
   6 The World (.hack)
jesus@charmander:~/git/mediawiki/pywikibot/core> python pwb.py listpages.py -family:wikipedia -lang:es -cat:'.hack' |uniq
   1 .hack
   2 .hack//G.U.
   3 .hack//Liminality
   4 .hack//Roots
   5 .hack//SIGN
   6 The World (.hack)
jesus@charmander:~/git/mediawiki/pywikibot/core> python pwb.py listpages.py -family:wikipedia -lang:es -cat:'1. FC Nürnberg'
   1 FC Nuremberg II
   2 F. C. Núremberg
   3 Stadion Nürnberg
jesus@charmander:~/git/mediawiki/pywikibot/core> python pwb.py listpages.py -family:wikipedia -lang:es -cat:'1. FC Nürnberg' |uniq
jesus@charmander:~/git/mediawiki/pywikibot/core>

I'm using Python 2.7.8 on Open SuSE 13.2

After debugging with xzise, we figured out it was a combination of issues.

  1. When piping, sys.stdout.encoding is set to None (on Python 2.7)
  2. When sys.stdout.encoding is set to None, config2.py falls back to 'iso-8859-1' (!)
  3. sys.argv is decoded using sys.stdout.encoding

Thus the category '1. FC Nürnberg' is decoded to u'1. FC Nürnberg', which obviously doesn't exist, and then no pages are listed.

By the way I think it would've been easier if listpages would say that no pages were found. That way we could've known that not the output is a problem but something with the generator.

Change 198515 had a related patch set uploaded (by Merlijn van Deen):
listpages: report number of pages found

https://gerrit.wikimedia.org/r/198515

Change 198515 merged by jenkins-bot:
listpages: report number of pages found

https://gerrit.wikimedia.org/r/198515

Ciencia_Al_Poder renamed this task from stdout output from script is lost when redirecting to a file or other commands to input encoding is switched to plain ascii when redirecting output to a file or other commands, mangling non-ascii characters.Mar 22 2015, 5:56 PM
Ciencia_Al_Poder set Security to None.

Another work-around is using python3, as python3 strings are utf-8 encoded by default.

That is … technically not correct. While Python 3 does work it's not because the strings are UTF-8 encoded but because the encoding for the streams are known. It uses sys.stdout.encoding which is None when piping and that then leads to pywikibot using Latin-1 or cp850.

Now I'm not sure what they are encoded internally but for someone writing Python scripts it shouldn't matter. I just wanted to say this here as a string is not immediately a UTF-8 encoded string and you need to encode it in any case (even into UTF-8) when you want to store it.

Change 231568 had a related patch set uploaded (by Merlijn van Deen):
Fall back to utf-8 console encoding

https://gerrit.wikimedia.org/r/231568

valhallasw claimed this task.

We now fall back to utf-8 by default.

Change 231568 merged by jenkins-bot:
Fall back to utf-8 console encoding

https://gerrit.wikimedia.org/r/231568