Page MenuHomePhabricator

Stretch in docker registry forces ascii encoding
Closed, ResolvedPublic

Description

I have this blubber file for Wikilabels (source)

version: v3
base: docker-registry.wikimedia.org/wikimedia-stretch:latest
apt:
  packages: [postgresql-server-dev-all, postgresql, libffi-dev, g++, python3-dev, libmemcached-dev, python3-setuptools, ca-certificates, libz-dev]
python:
  version: python3.5
lives:
  in: /srv/service
runs:
  environment:
    PYTHONUTF8: 1
    PYTHONIOENCODING: utf-8
variants:
  build:
    python:
      requirements: [requirements.txt]
  development:
    includes: [build]
    entrypoint: [./utility, dev_server]

  test:
    includes: [build]
    python:
      requirements: [requirements.txt, requirements-test.txt]
    runs:
      insecurely: true
    entrypoint: ["pytest", "-vvv", "--cov=wikilabels", "-m 'not nottravis'"]

  prep:
    includes: [build]
    node:
      env: production

  production:
    base: debian:stretch-slim
    node:
      env: production
    copies: prep
    entrypoint: [node, server.js]

After building the docker file and running test, it always fail with UnicodeError and uses ascii as the default encoding. This is the docker file:

FROM docker-registry.wikimedia.org/wikimedia-stretch:latest
USER "root"
ENV HOME="/root"
ENV DEBIAN_FRONTEND="noninteractive"
RUN apt-get update && apt-get install -y "postgresql-server-dev-all" "postgresql" "libffi-dev" "g++" "python3-dev" "libmemcached-dev" "python3-setuptools" "ca-certificates" "libz-dev" && rm -rf /var/lib/apt/lists/*
RUN python3.5 "-m" "easy_install" "pip" && python3.5 "-m" "pip" "install" "-U" "setuptools" "wheel" "tox"
RUN groupadd -o -g "65533" -r "somebody" && useradd -o -m -d "/home/somebody" -r -g "somebody" -u "65533" "somebody" && mkdir -p "/srv/service" && chown "65533":"65533" "/srv/service" && mkdir -p "/opt/lib" && chown "65533":"65533" "/opt/lib"
RUN groupadd -o -g "900" -r "runuser" && useradd -o -m -d "/home/runuser" -r -g "runuser" -u "900" "runuser"
USER "somebody"
ENV HOME="/home/somebody"
WORKDIR "/srv/service"
ENV PYTHONIOENCODING="utf-8" PYTHONUTF8="1"
ENV PIP_FIND_LINKS="file:///opt/lib/python" PIP_WHEEL_DIR="/opt/lib/python"
RUN mkdir -p "/opt/lib/python"
COPY --chown=65533:65533 ["requirements.txt", "requirements-test.txt", "./"]
RUN python3.5 "-m" "pip" "wheel" "-r" "requirements.txt" "-r" "requirements-test.txt" && python3.5 "-m" "pip" "install" "--target" "/opt/lib/python/site-packages" "-r" "requirements.txt" "-r" "requirements-test.txt"
COPY --chown=65533:65533 [".", "."]
ENV PATH="/opt/lib/python/site-packages/bin:${PATH}" PIP_NO_INDEX="1" PYTHONPATH="/opt/lib/python/site-packages"
ENTRYPOINT ["pytest", "-vvv", "--cov=wikilabels", "-m 'not nottravis'"]
LABEL blubber.variant="test" blubber.version="0.4.0+882f0fc"

I'm enforcing UTF-8 as much as possible but still I get something like this:

wikilabels/tests/test_auth_routes.py:1: in <module>
    from .routes_test_fixture import app  # noqa
wikilabels/tests/routes_test_fixture.py:6: in <module>
    from ..wsgi import server
wikilabels/wsgi/server.py:10: in <module>
    from . import assets, routes, sessions
wikilabels/wsgi/routes/__init__.py:7: in <module>
    from . import form_builder
wikilabels/wsgi/routes/form_builder.py:5: in <module>
    from ..util import build_script_tags, build_style_tags, static_file_path
wikilabels/wsgi/util.py:13: in <module>
    from ..i18n import i18n
wikilabels/i18n/__init__.py:12: in <module>
    MESSAGES[lang_code] = json.load(open(file_loc))
/usr/lib/python3.5/json/__init__.py:265: in load
    return loads(fp.read(),
/usr/lib/python3.5/encodings/ascii.py:26: in decode
    return codecs.ascii_decode(input, self.errors)[0]
E   UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 105: ordinal not in range(128)

I added installing locales package but didn't work either.
One hint is that if I change the base from wikimedia stretch to the dockerhub stretch, it fails for another reason.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

docker-registry.wikimedia.org/wikimedia-stretch:latest does not have locale config:
docker run --entrypoint=/usr/bin/locale docker-registry.wikimedia.org/wikimedia-stretch:latest

LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

For CI we have our own base containers based on that one, eg docker-registry.wikimedia.org/releng/ci-stretch which does:

Dockerfile.template
# Locale generation, auto generated by installing 'locales'
RUN echo "en_US.UTF-8 UTF-8" > /etc/locale.gen \
    && {{ "ca-certificates git locales" | apt_install }} \
    && install --directory --mode 777 "${XDG_CACHE_HOME}" /log /src

ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'

So yeah you need a locale.gen and install the locales package then set LC_ALL which should be sufficient.

And I forgot to post the repro case:

$ LC_ALL=POSIX python3 -c "print('étoile')"
Unable to decode the command from the command line:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 7: surrogates not allowed

But:

$ LC_ALL=fr_FR.UTF-8 python3 -c "print('étoile')"
étoile

Did yui know? "étoile" is french for "star"

Oh thanks! It looks awesome. Do you think we should put the locale bits into the blubber so it would be added automatically?

I am not that familiar with Release Pipeline (Blubber) , I would guess that when one does:

python:
  version: python3.5

Then Blubber should configure the locales as we did for the CI containers ( T210260#4770132 ). I think that is what most would expect, and surely anyone would want to have unicode support out of the box.

The alternative is to create an alternative base image based on docker-registry.wikimedia.org/wikimedia-stretchwhich would have unicode. But that seems to be too many images to have to maintaining. The overhead of having Blubber to always configure the locale seems minimal.

Seems like Blubber for python should add locale and set LC_ALL to something having UTF-8. Maybe C.UTF-8?

I confirm that generating the locales in the image, or installing locales-all (which has all locales generated) will make the étoile test work. I agree that it might be good for Blubber to automatically add one of those, but at least there's a workaround without Blubber changing. locales-all adds about 125 MB to the size of the image, so generating a UTF8 locale is probably better.

I don't know which locale should be used. The C.UTF8 locale does not seem to work with the test case.

I ran this command: docker run unitest env LC_ALL=fr_FR.UTF8 python3 -c "print('étoile')"

Varying that to set C.UTF8 or fi_FI.UTF8 gives different results. C.UTF8 fails , the fi or fr ones work.

What about en_US.UTF8? Most servers has this locale installed

19:21 exolobe4:~/uni $ docker run unitest env LC_ALL=en_US.UTF8 python3 -c "print('étoile')"
étoile

The en_US.UTF8 locale seems to also work. Note that the locale needs to be in the docker image; it doesn't matter what's on the host.

That sounds good to add to blubber. My knowledge in go (and docker) is not well to add it on my own. So I leave it to the team. Thanks <3

On my local Debian Stretch machine:

$ locale -a
C
C.UTF-8
fr_FR.utf8
POSIX
$ LC_ALL=C.UTF-8 python3 -c "print('étoile')"
étoile

But I guess we are fine going with en_US.UTF-8.

The base container has the appropriate locale apparently and we can have the container to set ENV LC_ALL=C.UTF-8 or let blubber do it.

$ docker run --rm -it docker-registry.wikimedia.org/stretch

 # locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

# locale -a
C
C.UTF-8
POSIX
# apt update && apt install python3-minimal
...
# export LC_ALL=C.UTF-8
# bash
# python3 -c 'print("étoile")'
étoile
#

Note, I had to reload bash to get LC_ALL to apply and be able to write étoile in the terminal.

Hm, this is strange now. This works on my host:

env LC_ALL=C.UTF8 python3 -c "print('étoile')"

This does not (in case I do something stupid there):

docker run unitest env LC_ALL=C.UTF8 python3 -c "print('étoile')"

This is my Dockerfile:

FROM docker-registry.wikimedia.org/wikimedia-stretch:latest
USER "root"
ENV HOME="/root"
ENV DEBIAN_FRONTEND="noninteractive"
RUN apt-get update && apt-get install -y "postgresql-server-dev-all" "postgresql" "libffi-dev" "g++" "python3-dev" "libmemcached-dev" "python3-setuptools" "ca-certificates" "libz-dev" && rm -rf /var/lib/apt/lists/*
RUN python3.5 "-m" "easy_install" "pip" && python3.5 "-m" "pip" "install" "-U" "setuptools" "wheel" "tox"
RUN groupadd -o -g "65533" -r "somebody" && useradd -o -m -d "/home/somebody" -r -g "somebody" -u "65533" "somebody" && mkdir -p "/srv/service" && chown "65533":"65533" "/srv/service" && mkdir -p "/opt/lib" && chown "65533":"65533" "/opt/lib"
RUN groupadd -o -g "900" -r "runuser" && useradd -o -m -d "/home/runuser" -r -g "runuser" -u "900" "runuser"
USER "somebody"
ENV HOME="/home/somebody"
WORKDIR "/srv/service"
ENV PYTHONIOENCODING="utf-8" PYTHONUTF8="1"
ENV PIP_FIND_LINKS="file:///opt/lib/python" PIP_WHEEL_DIR="/opt/lib/python"
RUN mkdir -p "/opt/lib/python"

  1. COPY --chown=65533:65533 ["requirements.txt", "requirements-test.txt", "./"]
  2. RUN python3.5 "-m" "pip" "wheel" "-r" "requirements.txt" "-r" "requirements-test.txt" && python3.5 "-m" "pip" "install" "--target" "/opt/lib/python/site-packages" "-r" "requirements.txt" "-r" "requirements-test.txt"

COPY --chown=65533:65533 [".", "."]
ENV PATH="/opt/lib/python/site-packages/bin:${PATH}" PIP_NO_INDEX="1" PYTHONPATH="/opt/lib/python/site-packages"

ENTRYPOINT ["pytest", "-vvv", "--cov=wikilabels", "-m 'not nottravis'"]

LABEL blubber.variant="test" blubber.version="0.4.0+882f0fc"

This does report the C.UTF-8 locale as being available:

docker run unitest locale -a

C.UTF8 does not exist. In every other locale I try, a UTF8 suffix is an alias to the UTF-8 suffix (with the dash).

This works: docker run unitest env LC_ALL=C.UTF-8 python3 -c "print('étoile')"

I'd suggest that we use the C.UTF-8 locale, but I see no strong reason to prefer it over the en_US.UTF-8 locale.

If we want to set a default locale for images built by Blubber, we can set LC_ALL or LC_CTYPE. The former overrides all other locale environment variables, the latter only for character sets. The former is simpler, the latter is a smaller change to status quo. I don't know which is better.

@dduvall I'm looking at Blubber source and not sure where to add a default variable to generated Dockerfiles. Meaning, the Dockerfile should have a line like this:

ENV LC_ALL=C.UTF-8

The best I can come up with on my own is adding the following line as line 33 of config/runs.go (I think the function is called Merge):

run.Environment["LC_ALL"] = "C.UTF-8"

Is that a good way of doing it?

C.UTF8 does not exist. In every other locale I try, a UTF8 suffix is an alias to the UTF-8 suffix (with the dash).

This works: docker run unitest env LC_ALL=C.UTF-8 python3 -c "print('étoile')"

I'd suggest that we use the C.UTF-8 locale, but I see no strong reason to prefer it over the en_US.UTF-8 locale.

Actually there is. Avoid all the other non C locale related things (LC_NUMERIC, LC_COLLATE, etc, etc) and just obtain the UTF-8 functionality. Not only that but en_US and C locales are not fully interchangeable:

The C locale is the standard locale, it implements the ISO C standard and basically is a en_US locale with a metric system and 24 hours time format.

There has even for a while a package in debian to just get that (https://tracker.debian.org/pkg/open-infrastructure-locales-c.utf-8). It was not accepted in anything but sid and was removed a 1,5 years ago.

That being said as far as I am aware, this is a Debian (and derivatives) specific locale (not that it matters much, our images are based on Debian) and glibc is still ironing out the details (per https://sourceware.org/bugzilla/show_bug.cgi?id=17318)

I would advise using C.UTF-8 unless there is a clear reason not to. And since this is generic enough, it should not be blubber's job to make sure it exists (but rather that it is used - potentially override-able although I dont see a use case yet) so ping me if there is a problem with that and I 'll make sure it exists in the base images.

Whaou thank you Alexandros for the details. The world of locales puzzles me everytime I have to look into it.

I am tempted to suggest we add the locale tweak to the base images docker-registry.wikimedia.org/wikimedia-jessie and docker-registry.wikimedia.org/wikimedia-jessie. Seems we want to set LC_ALL to adjust all of them (and avoid imperial measurement system).

The CI containers will need to be adjusted as well:

integration/config$ git grep LC_ALL
ci-jessie/Dockerfile.template:ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'
ci-stretch/Dockerfile.template:ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'
operations-puppet/Dockerfile.template:ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'

I did just do a quick check on wikimedia-stretch image for this

$ docker run --rm -it docker-registry.wikimedia.org/wikimedia-stretch:latest
root@92dc0302edca:/# ls
bin  boot  dev	etc  home  lib	lib64  media  mnt  opt	proc  root  run  sbin  srv  sys  tmp  usr  var
root@92dc0302edca:/# locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
root@92dc0302edca:/# locale -a
C
C.UTF-8
POSIX
root@92dc0302edca:/# export LC_ALL=C.UTF-8
root@92dc0302edca:/# echo "\303toile"
étoile

so support for C.UTF-8 already exists in our images.

...
so support for C.UTF-8 already exists in our images.

Yes that is what I have stated in one of my first comments ;)

My latest reply was probably poorly phrased. We need to set LC_ALL=C.UTF-8. My question is where do we set it: in the base containers jessie and stretch or delegate to Blubber / child images to set it.

I have a preference to have it set in the base containers provided by SRE.

Change 478200 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] baseimages: Add a default LC_ALL C.UTF-8 locale

https://gerrit.wikimedia.org/r/478200

Change 478200 merged by Alexandros Kosiaris:
[operations/puppet@production] baseimages: Add a default LC_ALL C.UTF-8 locale

https://gerrit.wikimedia.org/r/478200

Following the merge of https://gerrit.wikimedia.org/r/478200 , can you possibly rebuild the two images please? :)

RepositoryTagImage idCreatedSize
docker-registry.wikimedia.org/wikimedia-stretchlatestac576ceda67113 months ago56.1MB
docker-registry.wikimedia.org/wikimedia-jessielatesta81cc7ec799813 months ago80.4MB
akosiaris claimed this task.

Following the merge of https://gerrit.wikimedia.org/r/478200 , can you possibly rebuild the two images please? :)

RepositoryTagImage idCreatedSize
docker-registry.wikimedia.org/wikimedia-stretchlatestac576ceda67113 months ago56.1MB
docker-registry.wikimedia.org/wikimedia-jessielatesta81cc7ec799813 months ago80.4MB

Done. Note, whatever images use a different LC_ALL without installing/generating locales first will probably fail so per T210260#4788262 the CI images will also have to be updated.

Oops, closed this by mistake. Re-opened, feel free to close when the issue is indeed resolved.

$ docker run --rm -it docker-registry.wikimedia.org/wikimedia-stretch:latest # 9086bef6f35b
# apt update && apt install python3
...

# locale
LANG=
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=C.UTF-8

# python3 -c "print('étoile')"
étoile

@Ladsgroup if you retry blubber for wikilabs, you should no more encounter the error. And I don't think you would need to set PYTHONUTF8=1 nor PYTHONIOENCODING: utf-8.

I will look at rebiulding the CI images:

ci-jessie/Dockerfile.template:ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'
ci-stretch/Dockerfile.template:ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'
operations-dnslint/Dockerfile.template:ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'
operations-puppet/Dockerfile.template:ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'

Which really means that every single containers have to be rebuild entirely. That would cause several packages to be upgraded which WILL cause an issue to some builds.

@Ladsgroup if you retry blubber for wikilabs, you should no more encounter the error. And I don't think you would need to set PYTHONUTF8=1 nor PYTHONIOENCODING: utf-8.

Yup, it works for me. Thanks!

greg subscribed.

@Ladsgroup if you retry blubber for wikilabs, you should no more encounter the error. And I don't think you would need to set PYTHONUTF8=1 nor PYTHONIOENCODING: utf-8.

Yup, it works for me. Thanks!

re-resolving