Page MenuHomePhabricator

Add Arabic Wikipedia support for CopyPatrol tool
Closed, ResolvedPublic1 Estimated Story Points

Description

Hi, according to this community discussion, we would like to request addition of arwiki to work with CopyPatrol tool, since there is no objections for two weeks, greetings for you all.

Event Timeline

MusikAnimal subscribed.

Just a note that this has been running for a few days now, there just apparently haven't been any copyvios yet. Once there are, https://copypatrol.toolforge.org/ar should magically start working.

While iThenticate, our plagiarism detection service, explicitly states they support Arabic, it's possible that it simply isn't that good. But from the logs and all indications, the bot is running. So let's just wait and see if anything shows up.

Thanks @MusikAnimal

Just a note that this has been running for a few days now, there just apparently haven't been any copyvios yet. Once there are, https://copypatrol.toolforge.org/ar should magically start working.

Is there any expected time when https://copypatrol.toolforge.org/ar will work? as until now (after 7 days) still give 404 Page Not Found: "The page you are looking for could not be found. Check the address bar to ensure your URL is spelled correctly. If all else fails, you can visit our home page at the link below."

MusikAnimal changed the task status from Open to Stalled.EditedJan 26 2021, 11:13 PM

Okay, I think there is a bug with the bot. Here is a stack trace of a recent error:

Traceback (most recent call last):
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 861, in <module>
    main()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 856, in main
    bot.run()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 617, in run
    self.report_uploads()  # report checked edits
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 510, in report_uploads
    self.report_log.add_report(rep['new'], rep['diff_date'], rep['title_no_ns'], rep['ns'], rep['report_id'], rep['source'])
  File "/mnt/nfs/labstore-secondary-tools-project/eranbot/gitPlagiabot/plagiabot/report_logger.py", line 90, in add_report
    report))
  File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 207, in execute
    args = tuple(map(db.literal, args))
  File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 304, in literal
    s = self.escape(o, self.encoders)
  File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 222, in unicode_literal
    return db.literal(u.encode(unicode_literal.charset))
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)

It was calling the add_report method, meaning it found a copyright violation, before it errored out due to an apparent encoding issue. That would explain why https://copypatrol.toolforge.org/ar still doesn't work!

There are other known encoding issues with CopyPatrol (T244665) that may or may not be related.

For the time being, it seems Arabic is not supported due to this bug. I have filed T273017 to investigate this further.

Sorry for the long wait!

MusikAnimal changed the task status from Stalled to Open.EditedFeb 4 2021, 8:30 PM

@alaa Sorry for the long wait. The fix for T273017 appears to have worked :) There is now an Arabic feed: https://copypatrol.toolforge.org/ar

I will wait for confirmation from you that all looks good before resolving this task.

I will wait for confirmation from you that all looks good before resolving this task.

Thanks a lot @MusikAnimal, it's working well.

@alaa @Mohnd_Kh There has only been one case closed since we enabled Arabic CopyPatrol: https://copypatrol.toolforge.org/ar/leaderboard

Is anyone using it? Or perhaps you are forgetting to mark cases as "Fixed" or "No action needed"? CopyPatrol is expensive to maintain, so if you are not using it we would prefer to turn it off. Thanks for your understanding!

I have disabled the cron job for Arabic Wikipedia, so there will be no more cases added to the feed unless someone comes forth and says you're actively using CopyPatrol. I'll wait another week or so before removing the Arabic feed altogether.