Page MenuHomePhabricator

UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)
Closed, ResolvedPublic


This is an error that is showing up in Eranbot for the Arabic language which was recently enabled with T271823.

Stack trace:

Traceback (most recent call last):
  File "/data/project/eranbot/gitPlagiabot/plagiabot/", line 861, in <module>
  File "/data/project/eranbot/gitPlagiabot/plagiabot/", line 856, in main
  File "/data/project/eranbot/gitPlagiabot/plagiabot/", line 617, in run
    self.report_uploads()  # report checked edits
  File "/data/project/eranbot/gitPlagiabot/plagiabot/", line 510, in report_uploads
    self.report_log.add_report(rep['new'], rep['diff_date'], rep['title_no_ns'], rep['ns'], rep['report_id'], rep['source'])
  File "/mnt/nfs/labstore-secondary-tools-project/eranbot/gitPlagiabot/plagiabot/", line 90, in add_report
  File "/usr/lib/python2.7/dist-packages/MySQLdb/", line 207, in execute
    args = tuple(map(db.literal, args))
  File "/usr/lib/python2.7/dist-packages/MySQLdb/", line 304, in literal
    s = self.escape(o, self.encoders)
  File "/usr/lib/python2.7/dist-packages/MySQLdb/", line 222, in unicode_literal
    return db.literal(u.encode(unicode_literal.charset))
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)

Our database encoding is set to utf8mb4 which I believe should work with Arabic.

Event Timeline

Restricted Application added subscribers: Cyberpower678, Aklapper. · View Herald Transcript

@eranroz Could I get a sanity check on ? I admit it's a bit of guesswork but it seems promising based on my quick research. If there's no opposition, I might go ahead and try this in production to see if it fixes support for Arabic. Thanks for the help.

I have minor concern about different libraries bheavior (MySQLdb vs oursql) and what exact version is used in the server and I wanted to run it experimentally/test it before actually merging. Please go ahead and give it a try - if we see it works well we can merge it. and thanks for helping to improve the tool!

Okay, that worked! There is now a feed for Arabic :) I suspect T244665 is now fixed too, but I have not confirmed that yet.

@eranroz I guess we are safe to merge that PR. I see the English feed is still populating as expected, too.

PR merged! Thank you. I'll go ahead and resolve this. Arabic Wikipedia has already confirmed CopyPatrol to be working for them: T271823