⚓ T333724 Deploy new CopyPatrol bot and update frontend

Status	Assigned	Task
Open	None	T331289 Request for zhwiki to be added to CopyPatrol
Resolved	JJMC89	T362124 Update the CopyPatrol URL
Resolved	MusikAnimal	T333724 Deploy new CopyPatrol bot and update frontend
Resolved	MusikAnimal	T340600 Rewrite CopyPatrol frontend app to use Symfony and the new copyvio bot
Resolved	Andrew	T345755 Request creation of copypatrol VPS project
Resolved	JJMC89	T293688 Write new CopyPatrol backend to replace Plagiabot
Resolved	JJMC89	T217986 Remove bot edits from list

Restricted Application added subscribers: Cyberpower678, Aklapper. · View Herald TranscriptMar 31 2023, 8:11 PM

MusikAnimal mentioned this in T293688: Write new CopyPatrol backend to replace Plagiabot.Mar 31 2023, 8:26 PM

MusikAnimal updated the task description. (Show Details)Apr 4 2023, 11:40 PM

MusikAnimal moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.Apr 5 2023, 2:34 PM

KSiebert set the point value for this task to 8.Apr 5 2023, 2:38 PM

TheresNoTime moved this task from Needs Discussion to Up Next (May 6-17) on the Community-Tech board.Apr 5 2023, 2:40 PM

MusikAnimal mentioned this in T334264: Have new CopyPatrol bot report the number of open cases on an on-wiki page.Apr 6 2023, 11:39 PM

MusikAnimal mentioned this in T334265: Proposal: Create new bot account to replace EranBot's plagiabot task.Apr 7 2023, 12:36 AM

• NRodriguez edited projects, added Community-Tech (CommTech-Sprint-44); removed Community-Tech.Apr 12 2023, 7:29 PM

Restricted Application edited projects, added Community-Tech; removed Community-Tech (CommTech-Sprint-44). · View Herald TranscriptApr 12 2023, 7:29 PM

MusikAnimal mentioned this in T334549: PageTriage should have the ability to remove the copyvio tag (via API).Apr 15 2023, 6:42 PM

MusikAnimal added a parent task: T334549: PageTriage should have the ability to remove the copyvio tag (via API).

MusikAnimal updated the task description. (Show Details)Apr 15 2023, 6:48 PM

JJMC89 mentioned this in T217986: Remove bot edits from list.Apr 19 2023, 3:10 PM

Aklapper added a parent task: T334264: Have new CopyPatrol bot report the number of open cases on an on-wiki page.May 13 2023, 5:37 PM

1AmNobody24 subscribed.May 17 2023, 8:38 AM

dmaza edited projects, added Community-Tech (CommTech-Kanban); removed Community-Tech.Jun 1 2023, 5:15 PM

MusikAnimal claimed this task.Jun 14 2023, 5:51 PM

MusikAnimal moved this task from Ready 🎬 to In Development 💻 on the Community-Tech (CommTech-Kanban) board.

MusikAnimal updated the task description. (Show Details)Jun 14 2023, 10:13 PM

MusikAnimal added a subtask: T340600: Rewrite CopyPatrol frontend app to use Symfony and the new copyvio bot.Jun 27 2023, 11:42 PM

MusikAnimal updated the task description. (Show Details)

MusikAnimal mentioned this in T340600: Rewrite CopyPatrol frontend app to use Symfony and the new copyvio bot.Jun 27 2023, 11:45 PM

MusikAnimal mentioned this in T331289: Request for zhwiki to be added to CopyPatrol.Jul 7 2023, 5:50 PM

MusikAnimal added a parent task: T331289: Request for zhwiki to be added to CopyPatrol.

JJMC89 mentioned this in T341217: Hide small edits.Jul 8 2023, 6:33 PM

MusikAnimal updated the task description. (Show Details)Jul 25 2023, 4:13 PM

MusikAnimal closed subtask T340600: Rewrite CopyPatrol frontend app to use Symfony and the new copyvio bot as Resolved.Jul 25 2023, 4:37 PM

Initial findings with new bot versus the old:

These seem to be copyvios, reported by new bot but not the old (GOOD):

Confirmed copyvios NOT reported by the new bot, but was reported by old (BAD, maybe?):

https://copypatrol.toolforge.org/en/?id=100990412
~~https://copypatrol.toolforge.org/en/?id=100989249~~ the new bot did report this, but it errors out https://plagiabot.toolforge.org/en?id=19b407b6-a536-429d-ae0e-fca4ea996228
probably others, but the sample size is small since the new bot was only started on 2023-07-25

https://copypatrol.toolforge.org/en/?id=100989249 is https://plagiabot.toolforge.org/en?id=19b407b6-a536-429d-ae0e-fca4ea996228

Since the new bot is behind, if the page gets deleted or edit revdeled, then the new bot will skip it since it cannot access the text. I'm guess that is the case for https://copypatrol.toolforge.org/en/?id=100990412.

Since the new bot is behind, if the page gets deleted or edit revdeled, then the new bot will skip it since it cannot access the text. I'm guess that is the case for https://copypatrol.toolforge.org/en/?id=100990412.

Ah, that makes sense! Also I fixed two bugs, one with the regex for the ithenticate route that you mentioned in the PR, but also I fixed the issue that prevented https://plagiabot.toolforge.org/en?id=19b407b6-a536-429d-ae0e-fca4ea996228 from loading (no editcount for IPs).

JJMC89 mentioned this in T342942: Request increased quota for plagiabot Toolforge tool.Jul 28 2023, 1:30 AM

T342942 definitely helped! Going back to comparing the old feed versus the new... there are still some shown in one and not the other. However, it would seem the new feed is more comprehensive. In the 20th hour of 2023-07-28 alone, the new feed has 20 reports while the old one has only 1. The one report does appear to be an actual copyvio, so that's concerning that the new feed isn't surfacing it. But I suppose the new feed is still better as nearly all the 20 cases I checked also seem legit. Maybe this is something we should report to Turnitin, as it would appear the logic changed from their old API versus the new one, and I think we should be upfront with our users about what those changes are.

In T333724#9052945, @MusikAnimal wrote:

The one report does appear to be an actual copyvio, so that's concerning that the new feed isn't surfacing it. But I suppose the new feed is still better as nearly all the 20 cases I checked also seem legit. Maybe this is something we should report to Turnitin, as it would appear the logic changed from their old API versus the new one, and I think we should be upfront with our users about what those changes are.

This was due to a bug in the backend. I had assumed that the rev_content_changed key would always be present for events in the revision-create stream, but it isn't present for page creations.
Should be fixed with https://github.com/JJMC89/copypatrol-backend/pull/37.

In T333724#9053111, @JJMC89 wrote:

Should be fixed with https://github.com/JJMC89/copypatrol-backend/pull/37.

Awesome, thanks!

Here are a few more I found that don't appear in the new feed:

Not sure if the bot is at fault, but if not I'd like to ask Turnitin about it. Either way, the new bot (or API, whichever it is) seems to pick up a lot more copyvios so I think we're definitely in net-positive territory :)

https://copypatrol.toolforge.org/en/?id=101070524 - This is a page creation from before I fixed the bot.
https://copypatrol.toolforge.org/en/?id=101070620 - This is my fault.^
https://copypatrol.toolforge.org/en/?id=101071445 - The bot was offline while I was deploying the update.

^ I was going to test with more CPU (couldn't since the max is 1), so I cleared the pending records from the database to be able to compare how the bot would keep up with edits.

@JJMC89 Okay, doing some more comparisons, here's a few outliers that I found:

Compare https://copypatrol.toolforge.org/en/?id=101166053 and https://plagiabot.toolforge.org/en?id=1de32c93-cf6a-45c6-9d53-ac72916337d7 – same article within the same few span of minutes, but different diffs! The diff picked up by the new bot doesn't by itself look like a copyvio (it's just moving existing content), but the other diff definitely is a potential copyvio. I'm wondering if maybe the bot got confused or something and just simply chose the wrong diff?
- Similar situation with (old) vs (new)
(old) and (new) reference the same diff, but give a different percentage of edit for the same URL. I'm not sure if this is expected or not. Other identical diffs have different percentages too on old vs. new, but I haven't seen as big of a difference as 75% vs 54%.
https://copypatrol.toolforge.org/en/?id=101162599 did not appear to be picked up by the new bot. The iThenticate report suggests this might be a real copyvio.

I can keep digging but I thought I'd stop there. Again, I hope this doesn't come off as complaints about the new bot or anything (I'm not even sure if it's the bot's fault!); I just want to again reiterate that overall the new bot seems to pick up a lot more copyvios \o/ I almost wonder if our users will feel overwhelmed? I guess they'll have the final say on authenticity of the reports as well as issues if any with frequency. If there are options to play with (i.e. only give me diffs > this size), maybe we could integrate things like that in the UI. I know we now store the rev_parent_id so I can cheaply get diff sizes and add a filter for it. Also we have the percentages in storage, so we could add a filter for "only reports with matches > N% of the diff". I'll create tickets for these later!

In T333724#9061128, @MusikAnimal wrote:

@JJMC89 Okay, doing some more comparisons, here's a few outliers that I found:

Compare https://copypatrol.toolforge.org/en/?id=101166053 and https://plagiabot.toolforge.org/en?id=1de32c93-cf6a-45c6-9d53-ac72916337d7 – same article within the same few span of minutes, but different diffs! The diff picked up by the new bot doesn't by itself look like a copyvio (it's just moving existing content), but the other diff definitely is a potential copyvio. I'm wondering if maybe the bot got confused or something and just simply chose the wrong diff?

Both diffs were checked by the new bot.
1168252837 was just moving text.
1168264746 moved some text and added the below.

Harold V. Cohen of ''[[Pittsburgh Post-Gazette]]'' praised the film's "sharp and unmistakable" art style and animation, but found the characters underdeveloped and "not exactly memorable".<ref>{{cite magazine|url=https://books.google.com/books?id=nNJaAAAAIBAJ&pg=PA12&dq=sleeping+beauty+1959&article_id=5478,1150281&hl=en&sa=X&redir_esc=y#v=onepage&q=sleeping%20beauty%201959&f=false|title=Walt Disney's 'Sleeping Beauty' Comes to Nixon|magazine=[[Pittsburgh Post-Gazette]]|page=10|date=March 9, 1959|access-date=August 1, 2023|via=Google News Archive|last=Cohen|first=Harold V.}}</ref>

The old bot only looks at inserted text, but the new one looks at inserted and replaced text (as determined by python's difflib.SequenceMatcher, which cannot detect moves). The new bot saw 1168264746 as too small to send to Turnitin, but 1168252837 was big enough. Unless that one added sentence was copied, both are false positives due to moving text.

Similar situation with (old) vs (new)

1168197489 was reverted (detected by mw-reverted tag), and the new bot skips reverted edits.

(old) and (new) reference the same diff, but give a different percentage of edit for the same URL. I'm not sure if this is expected or not. Other identical diffs have different percentages too on old vs. new, but I haven't seen as big of a difference as 75% vs 54%.

Different amounts of text were submitted to Turnitin due to differences in wikitext cleaning. For example, the old bot submitted some incomplete parts of the list from the Administrative and Appeal Division section, but the new bot has the complete list.

https://copypatrol.toolforge.org/en/?id=101162599 did not appear to be picked up by the new bot. The iThenticate report suggests this might be a real copyvio.

The report for the new bot shows a maximum match of 34%, which is below the minimum of 50% to add to CopyPatrol.

I can keep digging but I thought I'd stop there. Again, I hope this doesn't come off as complaints about the new bot or anything (I'm not even sure if it's the bot's fault!); I just want to again reiterate that overall the new bot seems to pick up a lot more copyvios \o/ I almost wonder if our users will feel overwhelmed? I guess they'll have the final say on authenticity of the reports as well as issues if any with frequency. If there are options to play with (i.e. only give me diffs > this size), maybe we could integrate things like that in the UI. I know we now store the rev_parent_id so I can cheaply get diff sizes and add a filter for it. Also we have the percentages in storage, so we could add a filter for "only reports with matches > N% of the diff". I'll create tickets for these later!

T341217 has some related thoughts.

Btw, implementing the report viewer for the new reports would be helpful when reviewing and linking these. Finding reports in the Turnitin statistics can take a while and cannot be linked.

Questions:

Do we want to skip reverted edits?
Should 'replaced' text be included when checking diffs?
Should we change the minimum match threshold from 50%?

Do we want to skip reverted edits?

I think it's a fantastic feature! The one issue however is usually admins will want to revdel the text, too. Since mw-reverted is so easy to check now, we can maybe include all reverted edits, but clearly mark them as such on the frontend. Maybe even add a link to revdel it, if the logged in user is an admin.

Should 'replaced' text be included when checking diffs?

I don't see why not.

Should we change the minimum match threshold from 50%?

Maybe. I personally felt that https://copypatrol.toolforge.org/en/?id=101162599 was worthy of inclusion, as this was a new article. The iThenticate report (working link this time!) suggests that unless the source is freely licensed, that is a copyvio, no matter how small of the diff it may have been.

This all of course is just my opinion. I think at least we should include reverted edits that aren't revdel'd (I understand the bot has no choice but to ignore revdel'd edits anyway), but save the other questions for our users. We can try to get that feedback now, but I'm sure we'll get some once we flip the switch to the new bot, based on the larger influx of reports alone.

Novem_Linguae subscribed.Aug 12 2023, 8:03 AM

I asked because of the volume of reports to review. Compare what was in the old database for enwiki at migration (~215k) to what is pending from running about 16 days (~22k; ~1.4k/day). So, I'm looking for ways to reduce the number of reports, especially false positives.

> select lang, status, count(*) from diffs where status >= 0 group by lang, status;
+--------+--------+----------+
| lang   | status | count(*) |
+--------+--------+----------+
| en     |      0 |    22153 |
| en     |      1 |   104371 |
| en     |      2 |   110187 |
| es     |      0 |    35340 |
| es     |      1 |    13940 |
| es     |      2 |     4430 |
| fr     |      0 |    36718 |
| fr     |      1 |    10404 |
| fr     |      2 |      680 |
| simple |      0 |       35 |
| simple |      1 |       50 |
| simple |      2 |       19 |
+--------+--------+----------+

Besides those questions/issues, the old bot removed <ref>...</ref> if it contained less than 50 words, but the new one does not, which could be contributing to false positives.

In T333724#9061321, @MusikAnimal wrote:

Do we want to skip reverted edits?

I think it's a fantastic feature! The one issue however is usually admins will want to revdel the text, too. Since mw-reverted is so easy to check now, we can maybe include all reverted edits, but clearly mark them as such on the frontend. Maybe even add a link to revdel it, if the logged in user is an admin.

I can remove the check that skips these.

Should 'replaced' text be included when checking diffs?

I don't see why not.

It could be causing a number of false positives.

Should we change the minimum match threshold from 50%?

Maybe. I personally felt that https://copypatrol.toolforge.org/en/?id=101162599 was worthy of inclusion, as this was a new article. The iThenticate report (working link this time!) suggests that unless the source is freely licensed, that is a copyvio, no matter how small of the diff it may have been.

This all of course is just my opinion. I think at least we should include reverted edits that aren't revdel'd (I understand the bot has no choice but to ignore revdel'd edits anyway), but save the other questions for our users. We can try to get that feedback now, but I'm sure we'll get some once we flip the switch to the new bot, based on the larger influx of reports alone.

I'd prefer to get some feedback before going live using the staging tool if possible.

Diannaa subscribed.Aug 17 2023, 12:53 AM

Isochrone subscribed.Aug 17 2023, 8:01 PM

MusikAnimal added a subtask: T345755: Request creation of copypatrol VPS project.Sep 6 2023, 7:14 PM

Andrew closed subtask T345755: Request creation of copypatrol VPS project as Resolved.Sep 16 2023, 5:27 PM

MusikAnimal mentioned this in T305318: Out of iThenticate credits.Sep 18 2023, 7:50 PM

MusikAnimal mentioned this in T354145: CopyPatrol is down - expired iThenticate account (2024).Jan 2 2024, 7:11 PM

MPGuy2824 subscribed.Jan 15 2024, 3:53 AM

This is blocked on the WMF signing an agreement with Turnitin.

JJMC89 removed a parent task: T334549: PageTriage should have the ability to remove the copyvio tag (via API).Feb 6 2024, 7:20 PM

JJMC89 added a parent task: T306888: Migrate ERANBOT project off of Grid Engine.Feb 6 2024, 7:24 PM

JJMC89 added a subtask: T293688: Write new CopyPatrol backend to replace Plagiabot.

JJMC89 mentioned this in T306888: Migrate ERANBOT project off of Grid Engine.Feb 6 2024, 7:29 PM

MusikAnimal mentioned this in T334272: Prevent user from reviewing their own edits.Feb 13 2024, 4:12 PM

JWheeler-WMF set Due Date to Mar 12 2024, 5:00 AM.Feb 15 2024, 6:55 PM

MusikAnimal updated the task description. (Show Details)Feb 15 2024, 6:59 PM

MusikAnimal updated the task description. (Show Details)

MusikAnimal updated the task description. (Show Details)Feb 28 2024, 9:50 PM

JJMC89 removed a parent task: T306888: Migrate ERANBOT project off of Grid Engine.Mar 12 2024, 4:16 PM

JWheeler-WMF removed Due Date.Mar 13 2024, 6:12 PM

MusikAnimal mentioned this in T357825: 500 Internal Server Error when attempting to view iTheticate reports.Mar 26 2024, 5:43 PM

Darylgolden subscribed.Mon, Apr 1, 5:54 AM

JJMC89 changed the task status from Stalled to Open.Mon, Apr 8, 5:37 PM

JJMC89 updated the task description. (Show Details)Mon, Apr 8, 5:45 PM

JJMC89 updated Other Assignee, added: JJMC89.Mon, Apr 8, 6:37 PM

JJMC89 updated the task description. (Show Details)Mon, Apr 8, 11:02 PM

JJMC89 updated the task description. (Show Details)

JJMC89 mentioned this in T362124: Update the CopyPatrol URL.Mon, Apr 8, 11:17 PM

JJMC89 added a parent task: T362124: Update the CopyPatrol URL.

JJMC89 updated the task description. (Show Details)

JJMC89 updated the task description. (Show Details)Tue, Apr 9, 3:55 PM

Mentioned in SAL (#wikimedia-cloud) [2024-04-09T17:42:40Z] <wmbot~jjmc89@tools-bastion-12> delete toolforge jobs arwiki enwiki eswiki frwiki simplewiki for T333724

MusikAnimal updated the task description. (Show Details)Tue, Apr 9, 5:44 PM

JJMC89 changed the task status from Open to In Progress.Tue, Apr 9, 5:47 PM

MusikAnimal updated the task description. (Show Details)Tue, Apr 9, 5:47 PM

MusikAnimal updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-cloud) [2024-04-09T18:07:51Z] <JJMC89> backend deployed on copypatrol-backend-prod-01 for T333724

JJMC89 updated the task description. (Show Details)Tue, Apr 9, 6:57 PM

JJMC89 closed this task as Resolved.Wed, Apr 10, 5:46 AM

JJMC89 updated the task description. (Show Details)

JJMC89 removed a parent task: T334264: Have new CopyPatrol bot report the number of open cases on an on-wiki page.Wed, Apr 10, 5:50 AM

MusikAnimal moved this task from In Development 💻 to Done 🏁 on the Community-Tech (CommTech-Kanban) board.Wed, Apr 10, 3:18 PM

JJMC89 moved this task from Backlog to Done on the CopyPatrol board.Wed, Apr 10, 4:36 PM

JJMC89 mentioned this in T286383: In CopyPatrol, some drafts are likely getting missed.Fri, Apr 12, 10:19 PM

JJMC89 mentioned this in T362456: CopyPatrol login session length too short.Fri, Apr 12, 10:32 PM

Deploy new CopyPatrol bot and update frontend
Closed, ResolvedPublic8 Estimated Story Points
Actions

Description

Details

Related Objects
Search...

Event Timeline

Deploy new CopyPatrol bot and update frontendClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Deploy new CopyPatrol bot and update frontend
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...