Page MenuHomePhabricator

Loading non-unicode image data fails sometimes
Open, In Progress, HighPublic

Description

When running exiftool against some files with non-ascii characters, we incur this error:

2023-01-16 11:04:38,734 ???? thumbor:ERROR UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 19: invalid start byte

2023-01-16 11:04:38,734 ???? thumbor:ERROR ERROR: Traceback (most recent call last):
  File "/opt/lib/python/site-packages/thumbor/handlers/__init__.py", line 212, in get_image
    result = await self._fetch(self.context.request.image_url)
  File "/opt/lib/python/site-packages/thumbor/handlers/__init__.py", line 876, in _fetch
    raise fetch_result.exception
  File "/opt/lib/python/site-packages/thumbor/handlers/__init__.py", line 844, in _fetch
    self.context.request.engine.load(fetch_result.buffer, extension)
  File "/srv/service/wikimedia_thumbor/engine/proxy/proxy.py", line 125, in load
    self.lcl[enginename].load(buffer, extension)
  File "/opt/lib/python/site-packages/thumbor/engines/__init__.py", line 195, in load
    image_or_frames = self.create_image(buffer)
  File "/srv/service/wikimedia_thumbor/engine/imagemagick/imagemagick.py", line 77, in create_image
    self.read_exif(temp_file)
  File "/srv/service/wikimedia_thumbor/engine/imagemagick/imagemagick.py", line 164, in read_exif
    values = s.decode('utf-8').split(': ', 1)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 19: invalid start byte

2023-01-16 11:04:38,734 ???? thumbor:ERROR [BaseHandler] get_image failed for url `https%3A//swift.discovery.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.50/5/50/2023-01-15_15-49-13_voeux-maire-Belfort.jpg`. error: `'utf-8' codec can't decode byte 0x9c in position 19: invalid start byte`
2023-01-16 11:04:38,735 ???? tornado.access:ERROR 500 GET /wikipedia/commons/thumb/5/50/2023-01-15_15-49-13_voeux-maire-Belfort.jpg/1169px-2023-01-15_15-49-13_voeux-maire-Belfort.jpg (10.64.48.230) 4516.76ms

This occurs because the output of exiftool emits Image Description : Vœux du Maire de Belfort, Damien Meslot, au gymnase Le Phare, Belfort, le 15 janvier 2023.. "œ" is byte 0x9c

This is the affected image

According to the author of exiftool, unicode output isn't guaranteed unless JSON or XML output is used - we shouldn't be manually splitting string values in the first place, and this is another good reason to use JSON output everywhere.

Event Timeline

hnowlan renamed this task from Attempting to load JPEG image data as unicode to Loading non-unicode image data fails sometimes.Wed, Jan 25, 12:46 PM
hnowlan updated the task description. (Show Details)

Change 883564 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] imagemagick: use JSON output from exiftool

https://gerrit.wikimedia.org/r/883564

hnowlan changed the task status from Open to In Progress.Wed, Jan 25, 5:03 PM