Shell LOCALE neither consistent nor sane across grid engine nodes
Open, NormalPublic

Description

A Python 3 script containing this code was executed with jsub:

import sys
print(sys.stdout.encoding)

The resulting .out file contained "ANSI_X3.4-1968".
Normally, people set the encoding to utf8. When people assume that the encoding is utf8, but it isn't, terrible things happen.

Another Python 3 script containing this code was executed with jsub:

print("Talk:Gülen movement")

The resulting .err file contained this:

Traceback (most recent call last):
  File "...", line 5, in <module>
    print("Talk:G\xfclen movement")
UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in position 6: ordinal not in range(128)

jsub is written in Perl, which is perfectly capable of using utf8 as its output encoding. Unicode is important enough to all of us, which leads me to propose that jsub be edited for this.

I am not an expert with Perl, but I would try to add "use utf8;\nuse open qw/:std :utf8/;" to the top of the file, right under "use warnings;".

On a slightly related note, scripts running as regular CGI also use the "ANSI_X3.4-1968" encoding. This may be out of scope of this bug though.


Version: unspecified
Severity: normal

Details

Reference
bz58784
bzimport raised the priority of this task from to Needs Triage.
bzimport set Reference to bz58784.
Sigma created this task.Dec 21 2013, 8:03 AM
scfc added a comment.Dec 21 2013, 9:01 AM

I can't reproduce either claim:

scfc@tools-login:~$ cat > test.py && chmod +x test.py && rm -f test.{out,err} && jsub ./test.py && while ! job test > /dev/null; do sleep 1; done && cat test.{out,err}
#!/usr/bin/python3
import sys
print(sys.stdout.encoding)
Your job 1933102 ("test") has been submitted
UTF-8
scfc@tools-login:~$ cat > test.py && chmod +x test.py && rm -f test.{out,err} && jsub ./test.py && while ! job test > /dev/null; do sleep 1; done && cat test.{out,err}
#!/usr/bin/python3
print("Talk:Gülen movement")
Your job 1933103 ("test") has been submitted
Talk:Gülen movement
scfc@tools-login:~$

Please provide a minimal example.

(Just to clear up some confusion: jsub doesn't actually execute the script; it just submits it to the job grid aka SGE/OGS.)

Partially reproduced it.

Using the first script:

local-legobot@tools-login:~/$ jsub ./test.py && while ! job test > /dev/null; do sleep 1; done && cat test.{out,err}
Your job 1933479 ("test") has been submitted
ANSI_X3.4-1968

Second script:

local-legobot@tools-login:~/$ jsub ./test.py && while ! job test > /dev/null; do sleep 1; done && cat test.{out,err}
Your job 1933488 ("test") has been submitted
Talk:Gülen movement

I think this should be a more generic request to make sure the environment on the exec hosts is the same as what someone has when testing in the interactive shell.

In any case, the problem is the following:

valhallasw@tools-login:~$ cat > test.sh
#!/bin/bash
locale
valhallasw@tools-login:~$ chmod +x test.sh
valhallasw@tools-login:~$ ./test.sh
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

valhallasw@tools-login:~$ jsub ./test.sh
valhallasw@tools-login:~$ cat test.out
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

Setting LANG="en_US.UTF-8" (or any other UTF-8 locale) should solve this issue.

Oh, and to reproduce the issues: compare

LANG=C python -c "print u'\xe4'"

to

LANG=en_US.UTF-8 python -c "print u'\xe4'"

scfc added a comment.Dec 21 2013, 1:23 PM

(In reply to comment #3)

[...]
Setting LANG="en_US.UTF-8" (or any other UTF-8 locale) should solve this
issue.

It apparently does, because I have:

export LANG=de_DE.UTF-8

in ~/.profile, and for me:

scfc@tools-login:~$ diff -u test.out <(./test.sh)
scfc@tools-login:~$

But my test account shows "LANG=en_US.UTF-8" interactively, but "jsub locale" gives "LANG=", even after "export LANG". The same occurs if I set the locale to non-"en_US.UTF8" before jsub with "export LANG=de_DE.UTF-8".

My assumption (and fear :-)) is that SGE sources ~/.profile before job execution, which means that there will be a *lot* of confusion on where to configure locales and how they are evaluated.

I don't want to go down that road if it can be avoided. Is it possible to explicitely set the locale in Python? Otherwise we could change jsub so that users can use qsub's "-v" option to set the locale in the environment:

scfc-test@tools-login:~$ qsub -b y -N locale-en -v LANG=en_US.UTF-8 locale
Your job 1934859 ("locale-en") has been submitted
scfc-test@tools-login:~$ qsub -b y -N locale-de -v LANG=de_DE.UTF-8 locale
Your job 1934865 ("locale-de") has been submitted
scfc-test@tools-login:~$ fgrep LANG locale-*.o*
locale-de.o1934865:LANG=de_DE.UTF-8
locale-de.o1934865:LANGUAGE=
locale-en.o1934859:LANG=en_US.UTF-8
locale-en.o1934859:LANGUAGE=
scfc-test@tools-login:~$

However that does not seem to solve the Python error:

scfc-test@tools-login:~$ cat test.py
#!/usr/bin/python
print u"\xe4"
scfc-test@tools-login:~$ qsub -b y -N python-locale-en -v LANG=en_US.UTF-8 ./test.py
Your job 1934872 ("python-locale-en") has been submitted
scfc-test@tools-login:~$ cat python-locale-en.*
Traceback (most recent call last):
File "./test.py", line 2, in <module>
print u"\xe4"
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
scfc-test@tools-login:~$

And for the dbreps tool I indeed had to use:

# Wrap sys.stdout into a StreamWriter to allow writing unicode.
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)

But that is Python 2.7.3 (cf. http://stackoverflow.com/questions/1473577/writing-unicode-strings-via-sys-stdout-in-python, http://pythonhosted.org/kitchen/unicode-frustrations.html, https://wiki.python.org/moin/PrintFails).

I don't know what the situation is for Python 3+.

Ahh, there's another catch.

valhallasw@tools-login:~$ python ./test.py | tee
Traceback (most recent call last):

File "./test.py", line 2, in <module>
  print u"\xe4"

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)

valhallasw@tools-login:~$ PYTHONIOENCODING=utf-8 python ./test.py | tee
ä

but that's painful to say the least.

Python 3 has no issues -- it will just use utf-8 if the LANG says so:

(test.py: print("\xe4") -- remember, str in py3 is unicode in py2)

valhallasw@tools-login:~$ python3 ./test.py | tee
ä

scfc added a comment.Dec 21 2013, 2:54 PM

(In reply to comment #5)

[...]
My assumption (and fear :-)) is that SGE sources ~/.profile before job
execution, which means that there will be a *lot* of confusion on where to
configure locales and how they are evaluated.

I don't want to go down that road if it can be avoided. Is it possible to
explicitely set the locale in Python? Otherwise we could change jsub so that
users can use qsub's "-v" option to set the locale in the environment:
[...]

No, we can't as a test on my account with setting LANG to de_DE.UTF-8 in ~/.profile shows:

scfc@tools-login:~$ qsub -b y -v LANG=it_IT.UTF-8 env
Your job 1935416 ("env") has been submitted
scfc@tools-login:~$ fgrep LANG env.o1935416
LANG=de_DE.UTF-8
scfc@tools-login:~$

In bug #48811 we encountered a similar problem: We need "-b y" for binary programs, but "-b y" adds a (login) shell to the call stack:

scfc@tools-login:~$ { echo '#!/usr/bin/python'; echo 'import os'; echo 'print os.environ["LANG"]'; } > env-test.py && chmod +x env-test.py
scfc@tools-login:~$ qsub -N test-without-b-y -v LANG=it_IT.UTF-8 ./env-test.py
Your job 1935503 ("test-without-b-y") has been submitted
scfc@tools-login:~$ qsub -N test-with-b-y -b y -v LANG=it_IT.UTF-8 ./env-test.py
Your job 1935504 ("test-with-b-y") has been submitted
scfc@tools-login:~$ grep . test-with*-b-y.*
test-with-b-y.o1935504:de_DE.UTF-8
test-without-b-y.o1935503:it_IT.UTF-8

There is a configuration variable login_shells in sge_conf(5), but I'll need to whip up Toolsbeta in shape to evaluate options.

For the time being I suggest wrapper scripts.

coren added a comment.Dec 21 2013, 3:48 PM

Part of the difficulty is that there is a combinatorial explosion of starting environments depending on more factors than you can shake a stick at (given the gridengine's propensity to try to "guess" at what you're trying to do, and to (silently) add a shell anytime it thinks you need to evaluate shell arguments).

The best rule of thumb is "if you need something specific in your environment, set it explicitly". I would recommend that one /always/ uses a shell wrapper that sets the environment; a simple generic one might be:

#! /bin/bash

export STUFF_I_NEED="foobar"
export PATH="/all:/the/places"
exec "$@"

This will set the STUFF_I_NEED then exec to the program given as argument without needlessly keeping a subshell around. That same script can then be reliably used to launch everything in a reliable way.

I *could* make a globally available script that relies on sourcing, say, .bashrc:

#! /bin/bash

. ~/.bashrc
exec "$@"

Which everyone could then use. I could even have it invoked implicitly by jsub at need.

coren added a comment.Mar 25 2014, 6:04 PM

Is this still a relevant issue?

Left without comment for >six months; reopen if the issue is still relevant.

valhallasw reopened this task as Open.Oct 15 2015, 8:59 PM

Reopening this. A simple example is the following:

valhallasw@tools-bastion-01:~$ cat unitest.py
#!/usr/bin/env python3

print('\u1234')
valhallasw@tools-bastion-01:~$ jsub -l release=trusty unitest.py
Your job 591928 ("unitest") has been submitted
valhallasw@tools-bastion-01:~$ tail unitest.*
==> unitest.err <==
Traceback (most recent call last):
  File "/home/valhallasw/unitest.py", line 3, in <module>
    print('\u1234')
UnicodeEncodeError: 'ascii' codec can't encode character '\u1234' in position 0: ordinal not in range(128)

which can be reproduced with

valhallasw@tools-bastion-01:~$ LC_ALL=C ./unitest.py
Traceback (most recent call last):
  File "./unitest.py", line 3, in <module>
    print('\u1234')
UnicodeEncodeError: 'ascii' codec can't encode character '\u1234' in position 0: ordinal not in range(128)

which is because locale is set to POSIX by SGE:

valhallasw@tools-bastion-01:~$ jsub locale
Your job 591974 ("locale") has been submitted
valhallasw@tools-bastion-01:~$ tail locale.out
==> locale.out <==
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

I think a sane default would be to use C.UTF-8 instead:

valhallasw@tools-bastion-01:~$ LC_ALL=C.UTF-8 ./unitest.py
ሴ
Restricted Application added a project: Cloud-Services. · View Herald TranscriptOct 15 2015, 8:59 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
coren removed coren as the assignee of this task.Nov 16 2015, 6:10 PM
coren added a subscriber: coren.
Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 16 2015, 6:10 PM
chasemp triaged this task as Normal priority.Nov 30 2015, 6:32 PM
chasemp added a subscriber: chasemp.
valhallasw moved this task from Triage to Backlog on the Toolforge board.Dec 22 2015, 12:07 PM
bd808 renamed this task from jsub and utf8 to Shell LOCALE neither consistent nor sane across grid engine nodes.Jun 16 2017, 9:19 PM
Kotz added a subscriber: Kotz.Jul 14 2017, 2:11 PM

I think the best solution would be to set both the execution environments in the grid and the user environment on bastion to default to UTF8 which is a modern de-facto standard.

bd808 added a subscriber: bd808.Jul 14 2017, 3:30 PM

I think the best solution would be to set both the execution environments in the grid and the user environment on bastion to default to UTF8 which is a modern de-facto standard.

I would generally agree with this. If we set the default locale to C.UTF-8 things should mostly work as expected. I wouldd actually be in favor of making C.UTF-8 the default locale across all Foundation servers. I fixed a goofy bug in Striker recently by explicitly setting this locale in it's uwsgi configuration.

Samwalton9 added a subscriber: Samwalton9.EditedNov 10 2017, 11:01 AM

I think the best solution would be to set both the execution environments in the grid and the user environment on bastion to default to UTF8 which is a modern de-facto standard.

Having just spent a load of time trying to fix encoding errors when running through the grid, I agree.

The best rule of thumb is "if you need something specific in your environment, set it explicitly".

For what it's worth, this was my fix. Added

export LC_ALL="en_US.UTF-8"

to the top of my jsub script and everything went well again.

Just spent an hour and a half debugging what could be making @Alchimista's bot not work through grid.
I agree with the folk calling for C.UTF-8 as the default locale across the board.

zhuyifei1999 claimed this task.EditedJan 26 2018, 8:24 PM

This can be fixed if -v LC_ALL=C.UTF-8 is a default argument to jsub. Let me see if this can break existing scripts.

Initial tests looks good:

tools.zhuyifei1999-test@tools-bastion-02:~$ cat unitest.sh 
#! /bin/bash
exec 2>&1
set -x

python2 -c 'print(u"\u1234".encode("utf-8"))'
python2 -c 'print(u"\u1234")'
python3 -c 'print(u"\u1234".encode("utf-8"))'
python3 -c 'print(u"\u1234")'
tools.zhuyifei1999-test@tools-bastion-02:~$ LC_ALL=C bash unitest.sh 
+ python2 -c 'print(u"\u1234".encode("utf-8"))'
ሴ
+ python2 -c 'print(u"\u1234")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u1234' in position 0: ordinal not in range(128)
+ python3 -c 'print(u"\u1234".encode("utf-8"))'
b'\xe1\x88\xb4'
+ python3 -c 'print(u"\u1234")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u1234' in position 0: ordinal not in range(128)
tools.zhuyifei1999-test@tools-bastion-02:~$ LC_ALL=C.UTF-8 bash unitest.sh 
+ python2 -c 'print(u"\u1234".encode("utf-8"))'
ሴ
+ python2 -c 'print(u"\u1234")'
ሴ
+ python3 -c 'print(u"\u1234".encode("utf-8"))'
b'\xe1\x88\xb4'
+ python3 -c 'print(u"\u1234")'
ሴ

But on grid python2 seems to ignore the locale:

tools.zhuyifei1999-test@tools-bastion-02:~$ rm LOCALE_C.* &> /dev/null; jsub -N LOCALE_C bash unitest.sh; sleep 5; cat LOCALE_C.outYour job 357341 ("LOCALE_C") has been submitted
+ python2 -c 'print(u"\u1234".encode("utf-8"))'
ሴ
+ python2 -c 'print(u"\u1234")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u1234' in position 0: ordinal not in range(128)
+ python3 -c 'print(u"\u1234".encode("utf-8"))'
b'\xe1\x88\xb4'
+ python3 -c 'print(u"\u1234")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u1234' in position 0: ordinal not in range(128)
tools.zhuyifei1999-test@tools-bastion-02:~$ rm LOCALE_C.UTF-8.* &> /dev/null; jsub -N LOCALE_C.UTF-8 -v LC_ALL=C.UTF-8 bash unitest.sh; sleep 5; cat LOCALE_C.UTF-8.out
Your job 357343 ("LOCALE_C.UTF-8") has been submitted
+ python2 -c 'print(u"\u1234".encode("utf-8"))'
ሴ
+ python2 -c 'print(u"\u1234")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u1234' in position 0: ordinal not in range(128)
+ python3 -c 'print(u"\u1234".encode("utf-8"))'
b'\xe1\x88\xb4'
+ python3 -c 'print(u"\u1234")'
ሴ
zhuyifei1999 removed zhuyifei1999 as the assignee of this task.Jan 26 2018, 8:40 PM