Page MenuHomePhabricator

Problem with utf-8 on Grid
Closed, DuplicatePublic

Description

I have a problem with utf-8. It seems to me something with the encoding python or console. Here script, in file with codepage utf-8:

# coding: utf8
print('слово')

It works on Tool labs: python3 utf8.pyслово.

But breaks at Grit: jsub -l release=trusty -N utf8test python3 utf8.py → utf8test.err:

print('\u0441\u043b\u043e\u0432\u043e')  
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)

Do not help variants with ".encode, .decode, ('utf-8'), ('ascii')".
Could you set the global utf-8 by default, as on Tool labs?

Event Timeline

@valhallasw, thanks you. I wrote export PYTHONIOENCODING=UTF-8 in .bash_profile, and run $source .bash_profile. Seems works.

The problem still exists. "Print" works, but don't call other programs.
The following script in utf-8, Unix newline format LF:

# coding: utf8
import sys
print(sys.stdin.encoding, sys.stdout.encoding)

string = 'python3 scripts/add_text.py -dir:~/ -simulate -file:listpages.txt -text:"{{Нет полных библиографических описаний}}"'
print(string)
os.system(string)

Out files:

utf-8 utf-8
python3 scripts/add_text.py -dir:~/ -simulate -file:listpages.txt -text:"{{Нет полных библиографических описаний}}"

Traceback (most recent call last):
  File "myscript.py", line 97, in <module>
    os.system(string)	
UnicodeEncodeError: 'ascii' codec can't encode characters in position 75-77: ordinal not in range(128)

Also does not work: 'PYTHONIOENCODING=utf8 python3 scripts/add_text.py ...'

You can try setting LANG=en_US.UTF-8 instead, but I can't guarantee that'll work. In general, you should not depend on Python's magic conversion from text to bytes, and just do that conversion yourself (using x.encode('utf-8')).

On a sidenote, you should probably use subprocess.Popen instead of os.system.

sys.getfilesystemencoding():
On bastion =>utf-8
On jsub -l release=trusty => ascii

Docs of that call points to NL_LANGINFO(3), a C library function.

My os.environ:

environ({'UPSTART_EVENTS': 'runlevel', 'LIBRARY_PATH': '/data/project/yifeibot/.local/lib', 'UPSTART_JOB': 'rc', 'UPSTART_INSTANCE': '', 'NSLOTS': '1', 'TMP': '/tmp/187097.1.task', 'RUNLEVEL': '2', 'SGE_JOB_SPOOL_DIR': '/var/spool/gridengine/execd/tools-exec-1409/active_jobs/187097.1', 'SGE_O_HOST': 'tools-bastion-02', 'HOSTNAME': 'tools-exec-1409.eqiad.wmflabs', 'SGE_O_LOGNAME': 'tools.yifeibot', 'SGE_TASK_ID': 'undefined', 'QUEUE': 'task', 'LOGNAME': 'tools.yifeibot', 'SGE_CWD_PATH': '/data/project/yifeibot', 'SGE_O_SHELL': '/bin/bash', 'SGE_CELL': 'default', 'PATH': '/tmp/187097.1.task:/usr/local/bin:/bin:/usr/bin:/data/project/yifeibot/.local/bin', 'NHOSTS': '1', 'TMPDIR': '/tmp/187097.1.task', 'ARC': 'lx26-amd64', 'CPATH': '/data/project/yifeibot/.local/include', 'SGE_O_HOME': '/data/project/yifeibot', 'JOB_NAME': 'T143691', 'SGE_O_MAIL': '/var/mail/tools.yifeibot', 'SGE_ROOT': '/var/lib/gridengine', 'PKG_CONFIG_PATH': '/data/project/yifeibot/.local/lib/pkgconfig', 'JOB_ID': '187097', 'SHLVL': '1', 'runlevel': '2', 'SGE_TASK_STEPSIZE': 'undefined', 'USER': 'tools.yifeibot', 'PREVLEVEL': 'N', 'SGE_STDIN_PATH': '/dev/null', 'SGE_BINARY_PATH': '/usr/sbin/lx26-amd64', 'REQUEST': 'T143691', 'REQNAME': 'T143691', 'SHELL': '/bin/bash', 'SGE_ACCOUNT': 'sge', 'SGE_STDERR_PATH': '/data/project/yifeibot/T143691.err', 'SGE_ARCH': 'lx26-amd64', 'SGE_EXECD_PIDFILE': '/var/run/gridengine/execd.pid', 'TERM': 'linux', 'ENVIRONMENT': 'BATCH', '_': '/usr/bin/python3.4', 'SGE_TASK_FIRST': 'undefined', 'PWD': '/data/project/yifeibot', 'SGE_STDOUT_PATH': '/data/project/yifeibot/T143691.out', 'SGE_O_WORKDIR': '/data/project/yifeibot', 'RESTARTED': '0', 'NQUEUES': '1', 'SGE_O_PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/data/project/yifeibot/.local/bin', 'HOME': '/data/project/yifeibot', 'previous': 'N', 'SGE_TASK_LAST': 'undefined', 'JOB_SCRIPT': '/usr/bin/python3.4'})

SGE lacks locale env vars?

SGE does not pass any environment variables unless explicitly specified, except for the ones you noted. Explicitly passing is possible, but will likely break when tools switches to kubernetes -- just submit a bash script that sets the environment variables and then calls the command you're trying to run.

On a sidenote, you should probably use subprocess.Popen instead of os.system.

Unfortunately, it same breaks. It seems still called the "os" module. Python:

string = 'LANG="ru_RU.UTF-8" python3 scripts/add_text.py -dir:~/ -simulate -file:listpages.txt -text:"{{Нет полных библиографических описаний}}"'
# Same with: LANG="en_US.UTF-8"

import subprocess
subprocess.call(string)

Breaks:

Traceback (most recent call last):
  File "/data/project/vltools/locale.py", line 13, in <module>
    subprocess.call(string, shell=False)
  File "/usr/lib/python3.4/subprocess.py", line 533, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/usr/lib/python3.4/subprocess.py", line 848, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.4/subprocess.py", line 1368, in _execute_child
    executable = os.fsencode(executable)
  File "/usr/lib/python3.4/os.py", line 766, in fsencode
    return filename.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 94-96: ordinal not in range(128)

On subprocess.call(string, shell=True) same, but without last strings:

  executable = os.fsencode(executable)
File "/usr/lib/python3.4/os.py", line 766, in fsencode
  return filename.encode(encoding, errors)

You can try setting LANG=en_US.UTF-8 instead, but I can't guarantee that'll work.

Not work, see above.
On bastion locale:
LANG=en_US.UTF-8 (here bug of forum, this string don't show in code block below)

LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

But on jsub -l release=trusty locale:

LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

In general, you should not depend on Python's magic conversion from text to bytes, and just do that conversion yourself (using x.encode('utf-8')).

Is there a working example?

Works here's how:
Make myscript.sh:

#!/bin/bash
LANG="ru_RU.UTF-8" ./myscript.py
# "en_US.UTF-8" will same result

Make myscript.py:

#!/usr/bin/env python3
# coding: utf8
import os
string = 'python3 scripts/add_text.py -dir:~/ -always -file:listpages.txt -text:"{{Нет полных библиографических описаний}}"'
os.system(string)

Run: jsub -l release=trusty ./myscript.sh

scfc assigned this task to Vladis13.