Page MenuHomePhabricator

Handle encoding of sort keys
Closed, ResolvedPublic

Description

From the 2018-07-24 logs

Page [[en:Wikipedia:WikiProject Historic sites/Unused images of Historic Places in Canada]] saved
Traceback (most recent call last):
  File "/data/project/heritage/heritage/erfgoedbot/unused_monument_images.py", line 363, in <module>
    main()
  File "/data/project/heritage/heritage/erfgoedbot/unused_monument_images.py", line 356, in main
    cursor2))
  File "/data/project/heritage/heritage/erfgoedbot/unused_monument_images.py", line 85, in processCountry
    photos, withoutPhoto, countryconfig)
  File "/data/project/heritage/heritage/erfgoedbot/unused_monument_images.py", line 35, in group_unused_images_by_source
    pywikibot.warning(u'Got value error for {0}'.format(catSortKey))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 9: ordinal not in range(128)
CRITICAL: Closing network session.
<type 'exceptions.UnicodeDecodeError'>

Note that common.get_id_from_sort_key() starts by calling unicode(sort_key, 'utf-8')

If that conversion is the missing step then we probably want to do this already in getMonumentPhotos() (and in all other files calling common.get_id_from_sort_key() or querying for the sort_key)

Event Timeline

Change 447794 had a related patch set uploaded (by Lokal Profil; owner: Lokal Profil):
[labs/tools/heritage@master] [WIP]Ensure unicode encoding of query results

https://gerrit.wikimedia.org/r/447794

Looking at the database_connection.py I see that connect_to_commons_database() sets use_unicode=True, charset='latin1' whereas connect_to_monuments_database() sets use_unicode=True, charset='utf8'.

Is that potentially the root cause for these problems or would we still have to cast the response as unicode even with a utf8 charset?

Looking at the database_connection.py I see that connect_to_commons_database() sets use_unicode=True, charset='latin1' whereas connect_to_monuments_database() sets use_unicode=True, charset='utf8'.

Is that potentially the root cause for these problems or would we still have to cast the response as unicode even with a utf8 charset?

For archive happiness, this does not work.

So with the current patch (ensuring both page_title and sort_key are converted to unicode) we get issues with very long non-latin filenames.

This is because [[https://www.mediawiki.org/wiki/Manual:Categorylinks_table#cl_sortkey |cl_sortkey]] is stored as a binary and not guaranteed to convert to a valid string. Specifically when the sortkey is too long the representation of the last character may only be partially stored (e.g. 2 out of 4 bytes) meaning it cannot be converted to a string afterwards. (საქართველო, ქ. თელავი სახლის — შესასვლელი კარები — ახვლედიანის ქუჩაზე (ე. ახვლედიანის 24).jpg)

The solution is to use [[https://www.mediawiki.org/wiki/Manual:Categorylinks_table#cl_sortkey_prefix |cl_sortkey_prefix]] instead.

Even with cl_sortkey_prefix I get a similar error for some silly long sortkeys (قصر البارون امبان بمصر الجديدة.jpg). Still investigating those.

Looking into the case of the Egyptian image it seems to be a similar cropping issue the sort key is longer than 255 characters but gets cropped mid unicode character. Unsure if this is an MediaWiki error (documentation says "[cl_sortkey_prefix] is the human readable version of cl_sortkey" so one could argue that it should handle unicode better.

Anyway the solution for us is to set errors='replace' or errors='ignore' in the unicode function call.

Also I'm pretty sure the Egyptian template is being used incorrectly (either in this and a few specific cases or systematically) since only the first part of the parameter looks like the actual id.

Anyway the solution for us is to set errors='replace' or errors='ignore' in the unicode function call.

This is correct. Note that you will have to still use this option even after T200623 is fixed, since sortkeys with invalid UTF-8 will still exist in the database until the affected pages are edited (or null-edited).

Anyway the solution for us is to set errors='replace' or errors='ignore' in the unicode function call.

This is correct. Note that you will have to still use this option even after T200623 is fixed, since sortkeys with invalid UTF-8 will still exist in the database until the affected pages are edited (or null-edited).

Will do. Thanks for fixing the underlying issue.

Change 447794 merged by jenkins-bot:
[labs/tools/heritage@master] Ensure unicode encoding of query results

https://gerrit.wikimedia.org/r/447794

Mentioned in SAL (#wikimedia-cloud) [2018-08-13T09:24:12Z] <Lokal_Profil> Deploy latest from Git master: 5ea3c21, 0d6158d (T200325)