Page MenuHomePhabricator

Deploy new CopyPatrol bot and update frontend
Closed, ResolvedPublic8 Estimated Story Points

Description

The old bot (EranBot) that powered CopyPatrol was written in Python 2 which is now EOL. Our gracious rockstar of a volunteer @JJMC89 has rewritten the bot from scratch in Python 3 (T293688) and it is now ready to be deployed.

Repository for the bot: https://github.com/JJMC89/copypatrol-backend

Staging checklist

  • First deploy the bot to staging to ensure everything works smoothly
  • For now, use the test db s52615__copypatrol_migrate_test_02_p
  • Rework CopyPatrol to interface with the new bot (T340600)
  • Deploy the new CopyPatrol code to a VPS test instance
  • Seek approval from Turnitin. We're currently only using the sandbox version of TCA (Turnitin Core API). This conversation with them could also negotiate a long-term supply of credits (T305318)

Production checklist

Details

Other Assignee
JJMC89

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
KSiebert set the point value for this task to 8.Apr 5 2023, 2:38 PM

Initial findings with new bot versus the old:

These seem to be copyvios, reported by new bot but not the old (GOOD):

Confirmed copyvios NOT reported by the new bot, but was reported by old (BAD, maybe?):

https://copypatrol.toolforge.org/en/?id=100989249 is https://plagiabot.toolforge.org/en?id=19b407b6-a536-429d-ae0e-fca4ea996228

Since the new bot is behind, if the page gets deleted or edit revdeled, then the new bot will skip it since it cannot access the text. I'm guess that is the case for https://copypatrol.toolforge.org/en/?id=100990412.

Since the new bot is behind, if the page gets deleted or edit revdeled, then the new bot will skip it since it cannot access the text. I'm guess that is the case for https://copypatrol.toolforge.org/en/?id=100990412.

Ah, that makes sense! Also I fixed two bugs, one with the regex for the ithenticate route that you mentioned in the PR, but also I fixed the issue that prevented https://plagiabot.toolforge.org/en?id=19b407b6-a536-429d-ae0e-fca4ea996228 from loading (no editcount for IPs).

T342942 definitely helped! Going back to comparing the old feed versus the new... there are still some shown in one and not the other. However, it would seem the new feed is more comprehensive. In the 20th hour of 2023-07-28 alone, the new feed has 20 reports while the old one has only 1. The one report does appear to be an actual copyvio, so that's concerning that the new feed isn't surfacing it. But I suppose the new feed is still better as nearly all the 20 cases I checked also seem legit. Maybe this is something we should report to Turnitin, as it would appear the logic changed from their old API versus the new one, and I think we should be upfront with our users about what those changes are.

The one report does appear to be an actual copyvio, so that's concerning that the new feed isn't surfacing it. But I suppose the new feed is still better as nearly all the 20 cases I checked also seem legit. Maybe this is something we should report to Turnitin, as it would appear the logic changed from their old API versus the new one, and I think we should be upfront with our users about what those changes are.

This was due to a bug in the backend. I had assumed that the rev_content_changed key would always be present for events in the revision-create stream, but it isn't present for page creations.
Should be fixed with https://github.com/JJMC89/copypatrol-backend/pull/37.

Awesome, thanks!

Here are a few more I found that don't appear in the new feed:

Not sure if the bot is at fault, but if not I'd like to ask Turnitin about it. Either way, the new bot (or API, whichever it is) seems to pick up a lot more copyvios so I think we're definitely in net-positive territory :)

^ I was going to test with more CPU (couldn't since the max is 1), so I cleared the pending records from the database to be able to compare how the bot would keep up with edits.

@JJMC89 Okay, doing some more comparisons, here's a few outliers that I found:

I can keep digging but I thought I'd stop there. Again, I hope this doesn't come off as complaints about the new bot or anything (I'm not even sure if it's the bot's fault!); I just want to again reiterate that overall the new bot seems to pick up a lot more copyvios \o/ I almost wonder if our users will feel overwhelmed? I guess they'll have the final say on authenticity of the reports as well as issues if any with frequency. If there are options to play with (i.e. only give me diffs > this size), maybe we could integrate things like that in the UI. I know we now store the rev_parent_id so I can cheaply get diff sizes and add a filter for it. Also we have the percentages in storage, so we could add a filter for "only reports with matches > N% of the diff". I'll create tickets for these later!

@JJMC89 Okay, doing some more comparisons, here's a few outliers that I found:

Both diffs were checked by the new bot.
1168252837 was just moving text.
1168264746 moved some text and added the below.

Harold V. Cohen of ''[[Pittsburgh Post-Gazette]]'' praised the film's "sharp and unmistakable" art style and animation, but found the characters underdeveloped and "not exactly memorable".<ref>{{cite magazine|url=https://books.google.com/books?id=nNJaAAAAIBAJ&pg=PA12&dq=sleeping+beauty+1959&article_id=5478,1150281&hl=en&sa=X&redir_esc=y#v=onepage&q=sleeping%20beauty%201959&f=false|title=Walt Disney's 'Sleeping Beauty' Comes to Nixon|magazine=[[Pittsburgh Post-Gazette]]|page=10|date=March 9, 1959|access-date=August 1, 2023|via=Google News Archive|last=Cohen|first=Harold V.}}</ref>

The old bot only looks at inserted text, but the new one looks at inserted and replaced text (as determined by python's difflib.SequenceMatcher, which cannot detect moves). The new bot saw 1168264746 as too small to send to Turnitin, but 1168252837 was big enough. Unless that one added sentence was copied, both are false positives due to moving text.

1168197489 was reverted (detected by mw-reverted tag), and the new bot skips reverted edits.

  • (old) and (new) reference the same diff, but give a different percentage of edit for the same URL. I'm not sure if this is expected or not. Other identical diffs have different percentages too on old vs. new, but I haven't seen as big of a difference as 75% vs 54%.

Different amounts of text were submitted to Turnitin due to differences in wikitext cleaning. For example, the old bot submitted some incomplete parts of the list from the Administrative and Appeal Division section, but the new bot has the complete list.

The report for the new bot shows a maximum match of 34%, which is below the minimum of 50% to add to CopyPatrol.

I can keep digging but I thought I'd stop there. Again, I hope this doesn't come off as complaints about the new bot or anything (I'm not even sure if it's the bot's fault!); I just want to again reiterate that overall the new bot seems to pick up a lot more copyvios \o/ I almost wonder if our users will feel overwhelmed? I guess they'll have the final say on authenticity of the reports as well as issues if any with frequency. If there are options to play with (i.e. only give me diffs > this size), maybe we could integrate things like that in the UI. I know we now store the rev_parent_id so I can cheaply get diff sizes and add a filter for it. Also we have the percentages in storage, so we could add a filter for "only reports with matches > N% of the diff". I'll create tickets for these later!

T341217 has some related thoughts.

Btw, implementing the report viewer for the new reports would be helpful when reviewing and linking these. Finding reports in the Turnitin statistics can take a while and cannot be linked.

Questions:

  • Do we want to skip reverted edits?
  • Should 'replaced' text be included when checking diffs?
  • Should we change the minimum match threshold from 50%?

Do we want to skip reverted edits?

I think it's a fantastic feature! The one issue however is usually admins will want to revdel the text, too. Since mw-reverted is so easy to check now, we can maybe include all reverted edits, but clearly mark them as such on the frontend. Maybe even add a link to revdel it, if the logged in user is an admin.

Should 'replaced' text be included when checking diffs?

I don't see why not.

Should we change the minimum match threshold from 50%?

Maybe. I personally felt that https://copypatrol.toolforge.org/en/?id=101162599 was worthy of inclusion, as this was a new article. The iThenticate report (working link this time!) suggests that unless the source is freely licensed, that is a copyvio, no matter how small of the diff it may have been.

This all of course is just my opinion. I think at least we should include reverted edits that aren't revdel'd (I understand the bot has no choice but to ignore revdel'd edits anyway), but save the other questions for our users. We can try to get that feedback now, but I'm sure we'll get some once we flip the switch to the new bot, based on the larger influx of reports alone.

I asked because of the volume of reports to review. Compare what was in the old database for enwiki at migration (~215k) to what is pending from running about 16 days (~22k; ~1.4k/day). So, I'm looking for ways to reduce the number of reports, especially false positives.

> select lang, status, count(*) from diffs where status >= 0 group by lang, status;
+--------+--------+----------+
| lang   | status | count(*) |
+--------+--------+----------+
| en     |      0 |    22153 |
| en     |      1 |   104371 |
| en     |      2 |   110187 |
| es     |      0 |    35340 |
| es     |      1 |    13940 |
| es     |      2 |     4430 |
| fr     |      0 |    36718 |
| fr     |      1 |    10404 |
| fr     |      2 |      680 |
| simple |      0 |       35 |
| simple |      1 |       50 |
| simple |      2 |       19 |
+--------+--------+----------+

Besides those questions/issues, the old bot removed <ref>...</ref> if it contained less than 50 words, but the new one does not, which could be contributing to false positives.

Do we want to skip reverted edits?

I think it's a fantastic feature! The one issue however is usually admins will want to revdel the text, too. Since mw-reverted is so easy to check now, we can maybe include all reverted edits, but clearly mark them as such on the frontend. Maybe even add a link to revdel it, if the logged in user is an admin.

I can remove the check that skips these.

Should 'replaced' text be included when checking diffs?

I don't see why not.

It could be causing a number of false positives.

Should we change the minimum match threshold from 50%?

Maybe. I personally felt that https://copypatrol.toolforge.org/en/?id=101162599 was worthy of inclusion, as this was a new article. The iThenticate report (working link this time!) suggests that unless the source is freely licensed, that is a copyvio, no matter how small of the diff it may have been.

This all of course is just my opinion. I think at least we should include reverted edits that aren't revdel'd (I understand the bot has no choice but to ignore revdel'd edits anyway), but save the other questions for our users. We can try to get that feedback now, but I'm sure we'll get some once we flip the switch to the new bot, based on the larger influx of reports alone.

I'd prefer to get some feedback before going live using the staging tool if possible.

JJMC89 changed the task status from Open to Stalled.Feb 6 2024, 7:19 PM

This is blocked on the WMF signing an agreement with Turnitin.

JJMC89 changed the task status from Stalled to Open.Mon, Apr 8, 5:37 PM
JJMC89 updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-cloud) [2024-04-09T17:42:40Z] <wmbot~jjmc89@tools-bastion-12> delete toolforge jobs arwiki enwiki eswiki frwiki simplewiki for T333724

JJMC89 changed the task status from Open to In Progress.Tue, Apr 9, 5:47 PM

Mentioned in SAL (#wikimedia-cloud) [2024-04-09T18:07:51Z] <JJMC89> backend deployed on copypatrol-backend-prod-01 for T333724

JJMC89 updated the task description. (Show Details)