Page MenuHomePhabricator

plagiabot is erroring out and not adding new reports
Closed, ResolvedPublic

Description

See this discussion. No new reports were added since 01:02, 19/03/2019, until at-least 12:11, 19/03/2019. The stuff was resolved thereafter but the same issues reappeared on 13:17, 21/03/2019. It's down, still.

Backtrace from enwiki.err (other wikis are experiencing the same error):

Traceback (most recent call last):
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 849, in <module>
    main()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 844, in main
    bot.run()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 596, in run
    for page in live_gen:
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 586, in <genexpr>
    filter_gen = lambda gen: (p for p in gen if self.page_filter(p))
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 574, in page_filter
    old_size = rcinfo['length']['old'] or 0
KeyError: u'old'

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
WingedBladesofGodric removed a subscriber: marcceleiro.

A core tool for copy-vio-detection over en-wiki. Raising priority to UBL.

MusikAnimal renamed this task from Copypatrol is down to plagiabot is erroring out and not adding new reports.Mar 25 2019, 7:01 PM
MusikAnimal updated the task description. (Show Details)

Still being a newb to Python, I'm not sure how to debug this. But whenever we do fix it, please truncate enwiki.err as it is currently > 1GB in size.

Just a quick comment to emphasize that addressing copyright issues is far easier when looking at them close to contemporaneously. They become more difficult as they get older. I'll explain if this isn't obvious.

In addition,we have a limited number of people handling the hundreds of reports each week. so if this takes some time, the backlog will be onerous.

Yes key we get this up and running again. I have pinged Ryan Kaldari.

@eranroz Your help fixing this would be much appreciated.

Okay, the bot is now running without errors. It looks as though it's going through all the older recent changes it missed, but I am not certain. More info to come.

MusikAnimal lowered the priority of this task from Unbreak Now! to High.Mar 25 2019, 9:22 PM

The bot is functioning again. It did not pick up where it left, unfortunately. I am not sure what level of effort is involved to backfill the data. As far as I can tell it works by looking at live recent changes, so we'd need a script or something to go back through the past 4 days of changes.

The fix is in plagiabot.py, and I will create a PR for it shortly. I think the bug happened after we moved the bot to Debian Stretch, which has a newer version of Python 2.

Lowering to High priority until we get the fix merged.

I do see a new addition, good sign.

I'm sorry to report that backfilling would be desirable. (I'll happily defer on this point if someone has an alternative solution.)

PR: https://github.com/valhallasw/plagiabot/pull/16

This is the first Python code I've ever written. There is likely a better way to do what I did :)

It looks like the main script has a flag to look over the past N days of recent changes, e.g. -recentchanges:3. However there is yet another bug... the script needs to be updated to use the new comment storage on the replicas. PR for that is at https://github.com/valhallasw/plagiabot/pull/17

@MBinder_WMF This is an accidental addition to the sprint board, I assume?

MusikAnimal moved this task from Ready to Done on the Community-Tech (Resolved 2018-19 Q4) board.

I think we can close this. If we still want to backfill the data we should open a new ticket. Unfortunately the PR to fix the backfill script didn't get merged in time; those old changes are no longer in recentchanges. We'd need to rework the script to work on revision, and also the actor migration.

I guess the question is how useful will the data be if run on older edits? Might be interesting to look. I assume that it will not be useful do to all the false positives from mirrors of WP but not certain.