Page MenuHomePhabricator

Add Arabic Wikipedia support for CopyPatrol tool
Closed, ResolvedPublic1 Estimated Story Points

Description

Hi, according to this community discussion, we would like to request addition of arwiki to work with CopyPatrol tool, since there is no objections for two weeks, greetings for you all.

Event Timeline

Mohnd_Kh created this task.Jan 12 2021, 3:44 PM
Restricted Application added subscribers: alaa, Aklapper. · View Herald TranscriptJan 12 2021, 3:44 PM
MusikAnimal added a subscriber: MusikAnimal.

Just a note that this has been running for a few days now, there just apparently haven't been any copyvios yet. Once there are, https://copypatrol.toolforge.org/ar should magically start working.

While iThenticate, our plagiarism detection service, explicitly states they support Arabic, it's possible that it simply isn't that good. But from the logs and all indications, the bot is running. So let's just wait and see if anything shows up.

MusikAnimal set the point value for this task to 1.
alaa added a comment.Jan 26 2021, 10:44 PM

Thanks @MusikAnimal

Just a note that this has been running for a few days now, there just apparently haven't been any copyvios yet. Once there are, https://copypatrol.toolforge.org/ar should magically start working.

Is there any expected time when https://copypatrol.toolforge.org/ar will work? as until now (after 7 days) still give 404 Page Not Found: "The page you are looking for could not be found. Check the address bar to ensure your URL is spelled correctly. If all else fails, you can visit our home page at the link below."

MusikAnimal changed the task status from Open to Stalled.EditedJan 26 2021, 11:13 PM

Okay, I think there is a bug with the bot. Here is a stack trace of a recent error:

Traceback (most recent call last):
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 861, in <module>
    main()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 856, in main
    bot.run()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 617, in run
    self.report_uploads()  # report checked edits
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 510, in report_uploads
    self.report_log.add_report(rep['new'], rep['diff_date'], rep['title_no_ns'], rep['ns'], rep['report_id'], rep['source'])
  File "/mnt/nfs/labstore-secondary-tools-project/eranbot/gitPlagiabot/plagiabot/report_logger.py", line 90, in add_report
    report))
  File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 207, in execute
    args = tuple(map(db.literal, args))
  File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 304, in literal
    s = self.escape(o, self.encoders)
  File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 222, in unicode_literal
    return db.literal(u.encode(unicode_literal.charset))
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)

It was calling the add_report method, meaning it found a copyright violation, before it errored out due to an apparent encoding issue. That would explain why https://copypatrol.toolforge.org/ar still doesn't work!

There are other known encoding issues with CopyPatrol (T244665) that may or may not be related.

For the time being, it seems Arabic is not supported due to this bug. I have filed T273017 to investigate this further.

Sorry for the long wait!

MusikAnimal changed the task status from Stalled to Open.EditedThu, Feb 4, 8:30 PM

@alaa Sorry for the long wait. The fix for T273017 appears to have worked :) There is now an Arabic feed: https://copypatrol.toolforge.org/ar

I will wait for confirmation from you that all looks good before resolving this task.

alaa added a comment.Fri, Feb 5, 4:15 PM

I will wait for confirmation from you that all looks good before resolving this task.

Thanks a lot @MusikAnimal, it's working well.

alaa closed this task as Resolved.Fri, Feb 5, 4:15 PM