Page MenuHomePhabricator

Toolforge crontab command cannot handle unicode crontab when the locale is non-utf8
Closed, ResolvedPublic

Description

Hello,
I just wanted to update my crontab and if I ask for a list with "crontab -l", I receive the following error message:

Traceback (most recent call last):
  File "/usr/local/bin/crontab", line 337, in <module>
    main()
  File "/usr/local/bin/crontab", line 280, in main
    crontab_text = crontab.load()
  File "/usr/local/bin/crontab", line 120, in load
    return self._remote('-l')
  File "/usr/local/bin/crontab", line 94, in _remote
    stdoutdata, stderrdata = ssh.communicate(input=stdin)
  File "/usr/lib/python3.4/subprocess.py", line 949, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib/python3.4/subprocess.py", line 1628, in _communicate
    self.stdout.encoding)
  File "/usr/lib/python3.4/subprocess.py", line 877, in _translate_newlines
    data = data.decode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 10: ordinal not in range(128)

Event Timeline

If I want to update with "crontab crontab.v3.03.txt" I receive the following error message:

Traceback (most recent call last):

File "/usr/local/bin/crontab", line 337, in <module>
  main()
File "/usr/local/bin/crontab", line 276, in main
  crontab.save(args.file.read())
File "/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
  return codecs.ascii_decode(input, self.errors)[0]

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

Aklapper renamed this task from I can not list nor edit my crontab on Toolforge to Cannot list nor edit my crontab on Toolforge: "UnicodeDecodeError: 'ascii' codec can't decode byte".Dec 3 2017, 8:40 PM

Which tool is this and on which host is it being run on? It works for me:

tools.zhuyifei1999-test@tools-bastion-03:~$ cat testcrontab 
# 啊1
tools.zhuyifei1999-test@tools-bastion-03:~$ crontab testcrontab
tools.zhuyifei1999-test@tools-bastion-03:~$ crontab -l
# 啊1
File "/usr/lib/python3.4/subprocess.py", line 949, in communicate
  stdout, stderr = self._communicate(input, endtime, timeout)
File "/usr/lib/python3.4/subprocess.py", line 1628, in _communicate
  self.stdout.encoding)
File "/usr/lib/python3.4/subprocess.py", line 877, in _translate_newlines
  data = data.decode(encoding)

This is different from that of tools-bastion-03

Funny things could happen with the python based crontab command. We are using Python 3.4 and subprocess.Popen.communicate in "universal_newlines" mode. Per the docs:

If universal_newlines is True, these file objects will be opened as text streams in universal newlines mode using the encoding returned by locale.getpreferredencoding(False).

locale.getpreferredencoding says:

Return the encoding used for text data, according to user preferences. User preferences are expressed differently on different systems, and might not be available programmatically on some systems, so this function only returns a guess

The remote side of this is a shell-less ssh session as the tool user to tools-crontab-01.

If the input from the local or remote sides contains non-UTF8 chars I would expect things to blow up. This is not a great thing, especially if the error is from the remote side when trying to run crontab -l too see the current crontab contents. Originally I tried to make this all use the latin-1 encoding which should be binary safe and pass through any data. That wasn't implemented properly apparently because it failed to pass any non-ASCII chars. We then changed things to work as they do now with the UTF-8 assumptions, or more accurately the assumption that locale.getpreferredencoding(False) would return UTF-8 and that that would be correct "most of the time".

I really don't like that this needs to care about encodings at all. In my ideal implementation /usr/bin/oge-crontab would not care at all and just pass bytes through from one side to the other like the prior Perl script did. Only the user's terminal session (for display) and the cron service (for parsing and execution) should care about the contents of the crontab. Thus far, coding that binary transparent wrapper in Python3 has eluded us however.

Only the user's terminal session (for display) and the cron service (for parsing and execution) should care about the contents of the crontab. Thus far, coding that binary transparent wrapper in Python3 has eluded us however.

The thing holding us back is add_jsub doing the parsing and adding jsubs, using regexes. I'm not sure how well regexes will operate on binary strings (the docs don't even mention binary). Maybe we could use binary for every other process? latin1 (possibly with errors='replace') could be used in the parsing to determine if jsub should be added, but I can't think of a way to reliably position the addition.

@zhuyifei1999 It is tool ato. I am trying to upgrade our tool with PuTTY on tools-dev.wmflabs.org as I usually did it.

This is indeed reproduceable with a non-utf8 locale:

tools.zhuyifei1999-test@tools-bastion-02:~$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=en_US.UTF-8
LC_TIME=en_US.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=en_US.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=en_US.UTF-8
LC_NAME=en_US.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_TELEPHONE=en_US.UTF-8
LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
LC_ALL=
tools.zhuyifei1999-test@tools-bastion-02:~$ LANG=en_US LC_ALL=en_US locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US
LANGUAGE=
LC_CTYPE="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_COLLATE="en_US"
LC_MONETARY="en_US"
LC_MESSAGES="en_US"
LC_PAPER="en_US"
LC_NAME="en_US"
LC_ADDRESS="en_US"
LC_TELEPHONE="en_US"
LC_MEASUREMENT="en_US"
LC_IDENTIFICATION="en_US"
LC_ALL=en_US
tools.zhuyifei1999-test@tools-bastion-02:~$ LANG=en_US LC_ALL=en_US crontab -l
Traceback (most recent call last):
  File "/usr/local/bin/crontab", line 337, in <module>
    main()
  File "/usr/local/bin/crontab", line 297, in main
    print(crontab.load(), end='')  # crontab already has lf
  File "/usr/local/bin/crontab", line 120, in load
    return self._remote('-l')
  File "/usr/local/bin/crontab", line 94, in _remote
    stdoutdata, stderrdata = ssh.communicate(input=stdin)
  File "/usr/lib/python3.4/subprocess.py", line 949, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib/python3.4/subprocess.py", line 1628, in _communicate
    self.stdout.encoding)
  File "/usr/lib/python3.4/subprocess.py", line 877, in _translate_newlines
    data = data.decode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 2: ordinal not in range(128)
zhuyifei1999 renamed this task from Cannot list nor edit my crontab on Toolforge: "UnicodeDecodeError: 'ascii' codec can't decode byte" to Toolforge crontab command cannot handle unicode crontab when the locale is non-utf8.Dec 4 2017, 6:24 AM
$ ssh tools-cron-01.tools.eqiad.wmflabs
$ file -bi /var/spool/cron/crontabs/tools.ato
text/plain; charset=iso-8859-1

iso-8859-1 is "latin-1" encoding.

When I view the file with less in a session using UTF8 encoding I see:
#perc <F3>ra nap h<F3>nap h<E9>tnapja
This shows the latin-1 characters that are invalid UTF8 codepoints.

I checked all of the crontabs on tools-cron-01 and it turns out that this is the only one that file sees as anything other than UTF-8 or plain ASCII.

I used vim to convert the latin-1 encoding to utf-8 on the cron server directly with :set fileencoding=utf-8. The same line from above now displays as:
#perc óra nap hónap hétnapja

I think this will fix the crontab -l problem. It should be impossible for another non-utf-8/ascii crontab file to be created. Can we call this done, or do we need to force a UTF-8 locale on the remote side of the ssh connection?

Python3 uses surrogates when the system pass in unicode information that cannot be interpreted with a given locale (PEP 383):

zhuyifei1999@zhuyifei1999-ThinkPad-X260:~$ LC_ALL=en_US LANG=en_US python3 -c 'print(repr(__import__("sys").argv[-1]))' /dev/null '啊''\udce5\udc95\udc8a'
zhuyifei1999@zhuyifei1999-ThinkPad-X260:~$ LC_ALL=en_US.UTF-8 LANG=en_US.UTF-8 python3 -c 'print(repr(__import__("sys").argv[-1]))' /dev/null '啊'
'啊'

Looking into python codecs docs, we can actually (ab)use it:

>>> b'\xe5\x95\x8a'.decode('ascii', 'surrogateescape')
'\udce5\udc95\udc8a'
>>> b'\xe5\x95\x8a'.decode('ascii', 'surrogateescape').encode('ascii', 'surrogateescape')
b'\xe5\x95\x8a'
>>> b'\xe5\x95\x8a'.decode('ascii', 'surrogateescape').encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Change 394923 had a related patch set uploaded (by Zhuyifei1999; owner: Zhuyifei1999):
[labs/toollabs@master] crontab: convert to mostly binary processing, and use surrogates for text

https://gerrit.wikimedia.org/r/394923

Thank you! At the end of the day I could update my crontab.

Unfortunatelly I had to delete the following two explanation lines from my crontab file:

#perc ora nap honap hetnapja
#(0-59) (0-23) (1-30/31) (1-12) (0-6)

Because with them I received the following error message:

"-":0: bad minute
errors in crontab file, can't install.
/usr/local/bin/crontab: unable to execute remote crontab command

So this task can be call done from my part.

I would have two more bonus questions. I hope any of you can answer.

  1. How can l set back my locale to default, as it originally was?
  2. How can I receive again a nice log about different tasks of my bot in my (Hungarian) language? It was nice up to ~2015-08-10.

Since 2015 August something is changed and I can not solve it.
See for example a log here: https://tools.wmflabs.org/ato/log/frissdump.txt

(we still need to make crontab command disregard the locale)

As for your bonus questions, could you come to irc freenode #wikimedia-cloud? I do not fully understand your issue and cause.

On IRC, we found that their $ python3 -c 'print(__import__("locale").getpreferredencoding(False))' returns UTF-8 (this fails to explain why the tracebacks able uses acsii to decode the binary streams), and subsequent calls to crontab errors out with UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 10: invalid continuation byte. I was able to reproduce this:

tools.zhuyifei1999-test@tools-bastion-02:~$ python3 -c '__import__("sys").stdout.buffer.write("#perc     óra       nap      hónap  hétnapja".encode("latin-1"))' > T181948
tools.zhuyifei1999-test@tools-bastion-02:~$ crontab T181948
Traceback (most recent call last):
  File "/usr/local/bin/crontab", line 337, in <module>
    main()
  File "/usr/local/bin/crontab", line 276, in main
    crontab.save(args.file.read())
  File "/usr/lib/python3.4/codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 10: invalid continuation byte

@Ato_01 Regarding the logs: does https://tools.wmflabs.org/yifeibot/frissdump.txt looks good to you? I converted the file from half-utf-8-half-latin-1 to entirely utf-8 with this script:

import sys

data = sys.stdin.buffer.read()

for line in data.split(b'\n'):
    try:
        print(line.decode('utf-8'))
    except UnicodeDecodeError:
        print(line.decode('latin-1'))

@zhuyifei1999 character types are excellent, I am missing only my line brakes. :)
I thought there is a switch somewhere in Linux OP system and all hungarian caracters in my log files (including ó, á, é etc.) will be written in the right form.

Change 394923 merged by Arturo Borrero Gonzalez:
[labs/toollabs@master] crontab: convert to mostly binary processing, and use surrogates for text

https://gerrit.wikimedia.org/r/394923

Change 405325 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[labs/toollabs@master] d/changelog: refresh entry for relase

https://gerrit.wikimedia.org/r/405325

Change 405325 merged by jenkins-bot:
[labs/toollabs@master] d/changelog: refresh entry for release

https://gerrit.wikimedia.org/r/405325

Steps to reproduce a bad crontab:

  1. prepare a file that does not decode with utf-8:
tools.zhuyifei1999-test@tools-bastion-02:~$ python -c 'open("T181948", "wb").write("#perc     \xf3ra       nap      h\xf3nap  h\xe9tnapja")'
tools.zhuyifei1999-test@tools-bastion-02:~$ python -c 'print(repr(open("T181948", "rb").read()))'
'#perc     \xf3ra       nap      h\xf3nap  h\xe9tnapja'
tools.zhuyifei1999-test@tools-bastion-02:~$ xxd T181948 
0000000: 2370 6572 6320 2020 2020 f372 6120 2020  #perc     .ra   
0000010: 2020 2020 6e61 7020 2020 2020 2068 f36e      nap      h.n
0000020: 6170 2020 68e9 746e 6170 6a61            ap  h.tnapja
  1. load the crontab:
tools.zhuyifei1999-test@tools-bastion-02:~$ crontab T181948 
Traceback (most recent call last):
  File "/usr/local/bin/crontab", line 337, in <module>
    main()
  File "/usr/local/bin/crontab", line 276, in main
    crontab.save(args.file.read())
  File "/usr/lib/python3.4/codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 10: invalid continuation byte

Because it insists on reading the file as utf-8, we use the surrogate fix:

tools.zhuyifei1999-test@tools-bastion-02:~$ curl https://phab.wmfusercontent.org/file/data/ztkgxykofgv2ghkga34n/PHID-FILE-g65lazz2f42fcvzwwdzr/oge-crontab > oge-crontab.py
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12306    0 12306    0     0   125k      0 --:--:-- --:--:-- --:--:--  126k
tools.zhuyifei1999-test@tools-bastion-02:~$ chmod a+x oge-crontab.py 
tools.zhuyifei1999-test@tools-bastion-02:~$ ./oge-crontab.py T181948 
tools.zhuyifei1999-test@tools-bastion-02:~$
  1. kaboom:
tools.zhuyifei1999-test@tools-bastion-02:~$ crontab -l
Traceback (most recent call last):
  File "/usr/local/bin/crontab", line 337, in <module>
    main()
  File "/usr/local/bin/crontab", line 297, in main
    print(crontab.load(), end='')  # crontab already has lf
  File "/usr/local/bin/crontab", line 120, in load
    return self._remote('-l')
  File "/usr/local/bin/crontab", line 94, in _remote
    stdoutdata, stderrdata = ssh.communicate(input=stdin)
  File "/usr/lib/python3.4/subprocess.py", line 949, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib/python3.4/subprocess.py", line 1628, in _communicate
    self.stdout.encoding)
  File "/usr/lib/python3.4/subprocess.py", line 877, in _translate_newlines
    data = data.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 10: invalid continuation byte
tools.zhuyifei1999-test@tools-bastion-02:~$ crontab -e
Traceback (most recent call last):
  File "/usr/local/bin/crontab", line 337, in <module>
    main()
  File "/usr/local/bin/crontab", line 280, in main
    crontab_text = crontab.load()
  File "/usr/local/bin/crontab", line 120, in load
    return self._remote('-l')
  File "/usr/local/bin/crontab", line 94, in _remote
    stdoutdata, stderrdata = ssh.communicate(input=stdin)
  File "/usr/lib/python3.4/subprocess.py", line 949, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib/python3.4/subprocess.py", line 1628, in _communicate
    self.stdout.encoding)
  File "/usr/lib/python3.4/subprocess.py", line 877, in _translate_newlines
    data = data.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 10: invalid continuation byte
tools.zhuyifei1999-test@tools-bastion-02:~$ crontab -r
Traceback (most recent call last):
  File "/usr/local/bin/crontab", line 337, in <module>
    main()
  File "/usr/local/bin/crontab", line 305, in main
    crontab.load()
  File "/usr/local/bin/crontab", line 120, in load
    return self._remote('-l')
  File "/usr/local/bin/crontab", line 94, in _remote
    stdoutdata, stderrdata = ssh.communicate(input=stdin)
  File "/usr/lib/python3.4/subprocess.py", line 949, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib/python3.4/subprocess.py", line 1628, in _communicate
    self.stdout.encoding)
  File "/usr/lib/python3.4/subprocess.py", line 877, in _translate_newlines
    data = data.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 10: invalid continuation byte
tools.zhuyifei1999-test@tools-bastion-02:~$ ./oge-crontab.py -r
tools.zhuyifei1999-test@tools-bastion-02:~$

It appears in the case of tools-bastion-02 that the package is the correct current version w/ the patch from https://gerrit.wikimedia.org/r/#/c/383770/1/misctools/oge-crontab but still demonstrates the issue :)

@aborrero did you potentially upgrade the jobutils package and not misctools?

root@tools-bastion-03:~# ls -al /usr/local/bin/crontab
lrwxrwxrwx 1 root root 20 Nov 20 20:54 /usr/local/bin/crontab -> /usr/bin/oge-crontab

root@tools-bastion-03:~# dpkg -L misctools | grep crontab
/usr/bin/oge-crontab

based on https://phabricator.wikimedia.org/T185314#3913246

Mentioned in SAL (#wikimedia-cloud) [2018-01-22T10:18:03Z] <arturo> T181948 deploy misctools 1.27 in the cluster

@aborrero did you potentially upgrade the jobutils package and not misctools?

True. Deployed now and updated the docs at: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#Bd808%27s_method_(pdebuild)

@zhuyifei1999 could you please test again?

I see now this:

tools-bastion-02.tools.eqiad.wmflabs: Traceback (most recent call last):
tools-bastion-02.tools.eqiad.wmflabs:   File "/usr/local/bin/crontab", line 345, in <module>
tools-bastion-02.tools.eqiad.wmflabs:     main()
tools-bastion-02.tools.eqiad.wmflabs:   File "/usr/local/bin/crontab", line 284, in main
tools-bastion-02.tools.eqiad.wmflabs:     crontab.save(args.file.read())
tools-bastion-02.tools.eqiad.wmflabs:   File "/usr/local/bin/crontab", line 130, in save
tools-bastion-02.tools.eqiad.wmflabs:     jsub_text = self._add_jsub(text)
tools-bastion-02.tools.eqiad.wmflabs:   File "/usr/local/bin/crontab", line 115, in _add_jsub
tools-bastion-02.tools.eqiad.wmflabs:     text = text.decode('ascii', 'surrogateescape').split('\n')
tools-bastion-02.tools.eqiad.wmflabs: AttributeError: 'str' object has no attribute 'decode'

and

tools-bastion-05.tools.eqiad.wmflabs: Traceback (most recent call last):
tools-bastion-05.tools.eqiad.wmflabs:   File "/usr/local/bin/crontab", line 345, in <module>
tools-bastion-05.tools.eqiad.wmflabs:     main()
tools-bastion-05.tools.eqiad.wmflabs:   File "/usr/local/bin/crontab", line 284, in main
tools-bastion-05.tools.eqiad.wmflabs:     crontab.save(args.file.read())
tools-bastion-05.tools.eqiad.wmflabs:   File "/usr/local/bin/crontab", line 130, in save
tools-bastion-05.tools.eqiad.wmflabs:     jsub_text = self._add_jsub(text)
tools-bastion-05.tools.eqiad.wmflabs:   File "/usr/local/bin/crontab", line 115, in _add_jsub
tools-bastion-05.tools.eqiad.wmflabs:     text = text.decode('ascii', 'surrogateescape').split('\n')
tools-bastion-05.tools.eqiad.wmflabs: AttributeError: 'str' object has no attribute 'decode'

Mentioned in SAL (#wikimedia-cloud) [2018-01-22T11:21:09Z] <arturo> puppet in the cluster is mostly fine, except for a couple of deprecation warnings, a conn timeout to services-01 and https://phabricator.wikimedia.org/T181948#3916790

What's the command that was sent to crontab? Is it reading from stdin or a file? In either case they should both be bytes, not str, in my understanding:

group.add_argument(
    'file', default=sys.stdin.buffer, nargs='?',
    type=argparse.FileType('rb'),
    help='replace crontab with file (default)')

I have trouble reproducing this in either cases:

tools.zhuyifei1999-test@tools-bastion-05:~$ crontab -r; echo '# 啊'  | crontab; crontab -l
# 啊
tools.zhuyifei1999-test@tools-bastion-05:~$ crontab -r; crontab <(echo '# 啊'); crontab -l
# 啊

Does the error happen on all three bastions? It it possible to print the contents of sys.argv, os.environ, repr(sys.stdin), & repr(sys.stdin.buffer)?

From what I can tell from puppet's source code, puppet uses stdin, with - as its argument to specify stdin.
https://github.com/puppetlabs/puppet/blob/4.10.x/lib/puppet/util/filetype.rb#L204

I got this:

tools.zhuyifei1999-test@tools-bastion-05:~$ crontab -r; echo '# 啊' | crontab -; crontab -l
Traceback (most recent call last):
  File "/usr/local/bin/crontab", line 345, in <module>
    main()
  File "/usr/local/bin/crontab", line 284, in main
    crontab.save(args.file.read())
  File "/usr/local/bin/crontab", line 130, in save
    jsub_text = self._add_jsub(text)
  File "/usr/local/bin/crontab", line 115, in _add_jsub
    text = text.decode('ascii', 'surrogateescape').split('\n')
AttributeError: 'str' object has no attribute 'decode'
no crontab for tools.zhuyifei1999-test

Apparently argparse.FileType does not respect whether the mode string is binary or text. This is https://bugs.python.org/issue14156

There are two ways I thought of that could fix this:

  • make the replace action, or the parent after parser.parse_args(), check if args.file is sys.stdin and if so set it to its buffer.
  • or make the argument open the file always in text mode, and always read on the file's buffer.

I slightly more prefer the former due to its seems less weird and we can point to the python bug so that nobody would undo the workaround, but the latter may seem more explicitly binary.

Change 405734 had a related patch set uploaded (by Zhuyifei1999; owner: Zhuyifei1999):
[labs/toollabs@master] crontab: Make the input file is binary when it is sys.stdin

https://gerrit.wikimedia.org/r/405734

tools.zhuyifei1999-test@tools-bastion-05:~$ diff `which crontab` crontab.py 
253a254,257
>     # https://bugs.python.org/issue14156
>     if args.file is sys.stdin:
>         args.file = args.file.buffer
> 
tools.zhuyifei1999-test@tools-bastion-05:~$ crontab -r; echo '# 啊' | ./crontab.py -; crontab -l
# 啊

Change 405734 merged by jenkins-bot:
[labs/toollabs@master] crontab: Make the input file is binary when it is sys.stdin

https://gerrit.wikimedia.org/r/405734

Mentioned in SAL (#wikimedia-cloud) [2018-01-22T18:32:22Z] <arturo> T181948 T185314 deploying jobutils and misctools v1.28 in the cluster

Now both tools-bastion-02 and tools-bastion-05 seems OK with regards to this issue.

zhuyifei1999 claimed this task.