Page MenuHomePhabricator

Encoding issues when handling unicode characters in filenames
Closed, ResolvedPublic

Description

Most likely py3 migration issue and something needs to be encoded properly/explicitly somewhere

2022-11-15 13:22:58,803 ???? thumbor:DEBUG [RESULT_STORAGE] IMAGE FOUND: /wikipedia/commons/thumb/0/01/NLC403-312001059925-93996_%E8%99%9E%E5%9F%8E%E7%B8%A3%E8%AA%8C_%E6%B8%85%E4%B9%BE%E9%9A%868%E5%B9%B4%281743%29_%E5%8D%B7%E5%9B%9B.pdf/page24-320px-NLC403-312001059925-93996_%E8%99%9E%E5%9F%8E%E7%B8%A3%E8%AA%8C_%E6%B8%85%E4%B9%BE%E9%9A%868%E5%B9%B4%281743%29_%E5
%8D%B7%E5%9B%9B.pdf.jpg
2022-11-15 13:22:58,813 ???? thumbor:ERROR UnicodeEncodeError: 'latin-1' codec can't encode characters in position 37-40: ordinal not in range(256)

2022-11-15 13:22:58,813 ???? thumbor:ERROR ERROR: Traceback (most recent call last):
  File "/opt/lib/python/site-packages/tornado/web.py", line 1713, in _execute
    result = await result
  File "/opt/lib/python/site-packages/thumbor/handlers/imaging.py", line 119, in get
    return await self.check_image(kw)
  File "/srv/service/wikimedia_thumbor/handler/images/images.py", line 460, in check_image
    await self.execute_image_operations()
  File "/opt/lib/python/site-packages/thumbor/handlers/__init__.py", line 172, in execute_image_operations
    await self.finish_request(result)
  File "/opt/lib/python/site-packages/thumbor/handlers/__init__.py", line 547, in finish_request
    result_from_storage, content_type
  File "/opt/lib/python/site-packages/thumbor/handlers/__init__.py", line 655, in _write_results_to_client
    self.finish()
  File "/opt/lib/python/site-packages/tornado/web.py", line 1159, in finish
    future = self.flush(include_footers=True)
  File "/opt/lib/python/site-packages/tornado/web.py", line 1097, in flush
    start_line, self._headers, chunk
  File "/opt/lib/python/site-packages/tornado/http1connection.py", line 454, in write_headers
    lines.extend(line.encode("latin1") for line in header_lines)
  File "/opt/lib/python/site-packages/tornado/http1connection.py", line 454, in <genexpr>
    lines.extend(line.encode("latin1") for line in header_lines)
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 37-40: ordinal not in range(256)

This appears to have been seen in Tornado elsewhere

Event Timeline

Change 859026 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] Encode headers before passing

https://gerrit.wikimedia.org/r/859026

Change 859026 merged by jenkins-bot:

[operations/software/thumbor-plugins@master] Encode headers before passing

https://gerrit.wikimedia.org/r/859026

This appears to be fixed. The issue relates to us calling Tornado's set_header with a string that contains non-ascii characters. Encoding these strings to be utf-8 bytestrings fixes the issue.

This appears to be fixed. The issue relates to us calling Tornado's set_header with a string that contains non-ascii characters. Encoding these strings to be utf-8 bytestrings fixes the issue.

Thank you, Hugh!
I just add a few more details. Tornado just requires us to pass type bytes when we want to set header value that contains non-ascii characters, here is the note about it in the official repo. RequestHandler._convert_header_value is called inside RequestHandler.set_header which is called in thumbor_plugins.