Page MenuHomePhabricator

Docker outputted wiki files cannot handle unicode filenames
Closed, ResolvedPublic

Description

Running the bot locally through Docker redirects the output to local files. The names of these files mimic the title of the wikipages where live output would have been saved. These filenames do not seem to handle utf-8 encoding properly.

To reproduce:
• Load up docker (you need commons-db which can be solved e.g. by checking out gerrit:448579).
• Harvest se-ship_sv
• run docker-compose run --rm bot python erfgoedbot/unused_monument_images.py -countrycode:se-ship -langcode:sv -log

Error

Traceback (most recent call last):
  File "erfgoedbot/unused_monument_images.py", line 361, in <module>
    main()
  File "erfgoedbot/unused_monument_images.py", line 338, in main
    (countrycode, lang)), conn, cursor, conn2, cursor2)
  File "erfgoedbot/unused_monument_images.py", line 97, in processCountry
    totals = output_country_report(unused_images, page)
  File "erfgoedbot/unused_monument_images.py", line 164, in output_country_report
    common.save_to_wiki_or_local(report_page, comment, text, minorEdit=False)
  File "/code/erfgoedbot/common.py", line 112, in save_to_wiki_or_local
    with open(filename, 'w', encoding='utf-8') as f:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 60: ordinal not in range(128)
<type 'exceptions.UnicodeEncodeError'>
CRITICAL: Closing network session.

Unclear if this never worked for non-latin pages or if it is due to some change in pywikibot.Page.title().

Event Timeline

Change 458484 had a related patch set uploaded (by Lokal Profil; owner: Lokal Profil):
[labs/tools/heritage@master] Ensure filename is utf-8 encoded

https://gerrit.wikimedia.org/r/458484

Lokal_Profil added a subscriber: JeanFred.

Moved the explanations here from the gerrit comments. Also makes those comments easier to format correctly ;)

Change 458484 merged by jenkins-bot:
[labs/tools/heritage@master] Ensure filename is utf-8 encoded

https://gerrit.wikimedia.org/r/458484