Incorporate Earwig's copyright detector into PageTriage
Open, Needs TriagePublicFeature
Actions

Assigned To

None

Authored By

	Novem_Linguae
	Feb 22 2023, 10:27 PM

Description

Earwig's copyright detection tool: https://copyvios.toolforge.org/

At first I dismissed incorporating this tool into PageTriage too difficult on the technical side. But there are multiple benefits that would justify some time and money spent on this.

Keep in mind, this tool is probably the best copyright detector, and is the one we encourage our patrollers to run manually on every new article.

Benefits of incorporating it into PageTriage:

We could methodically scan every new mainspace article and draft with it, achieving a 100% check rate. No chance of forgetting.
It usually takes around 15-40 seconds to run. Doing this check automatically before an NPP and AFC get to the article/draft would save time.
Incorporating it into PageTriage instead of third party means fewer clicks.
There are legal benefits to Wikipedia and WMF to being very methodical with copyright.

Technical challenges:

Sounds like Earwig's tool uses a search engine API, and that there is some kind of daily limit, after which the tool shuts off. We'd have to investigate the details of this limit (would money raise this limit? how much does it cost? do we already spend money on this or are we using a freemium API?)
Tool is currently third party. To incorporate it would take some work.
Coding it to only run once per article (rather than once per view, once per revision, etc.) would necessitate some kind of background process when the article is first created.
Depends on an external API, which may sometimes go down, be slow, etc.

Implementation ideas (similar to T330346: Detection and flagging of articles that are AI/LLM-generated)

We could create something similar to the pagetriagetagcopyvio API. Some external tool would use the EventStreams API or something similar to check every new article, then use the pagetriagetagcopyvio API to apply the tags
- Note: look into what tool currently uses the pagetriagetagcopyvio API. Look at how that tool works. See if it as good as Earwig.
- Note: see pagetriagetagcopyvio API parent task T199359
We could do the scan internally in PageTriage, adding the scan code somewhere near the code that sets the SQL pagetriage_page_tags (ArticleMetadata class), which as I recall already runs as a background process (it runs on article creation in PHP, and keeps running after output is sent to the browser, so the user notices no page load lag)

Credit to @Samwalton9 for the idea.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open	Feature	None	T330348 Incorporate Earwig's copyright detector into PageTriage
		Open		None	T330435 explore how CopyPatrol, Eranbot, and Earwig copyvio tool work

Event Timeline

Novem_Linguae created this task.Feb 22 2023, 10:27 PM

Restricted Application added a project: Growth-Team. · View Herald TranscriptFeb 22 2023, 10:27 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Novem_Linguae moved this task from Backlog to Priority big features on the PageTriage board.Feb 22 2023, 10:28 PM

Could you clarify what you mean by 'incorporating'? Does this mean using an API or somesuch to get a result from Earwig and then presenting it to users in PageTriage, or fully duplicating the Earwig functionality into PageTriage?

No idea. Would like to discuss in this ticket. Any of the "implementation idea" bullets above are possible technical solutions.

Novem_Linguae added a subtask: T330435: explore how CopyPatrol, Eranbot, and Earwig copyvio tool work.Feb 23 2023, 8:33 PM

Tool is currently third party. To incorporate it would take some work.

When we looked this a few years ago we decided we couldn't directly integrate Earwig's tool due to rules about calling non-production services form production.

We could look at the possibility of packaging its code https://github.com/earwig/copyvios for deployment in Wikimedia's Kubernetes cluster; then it would be possible to call its API directly from MediaWiki.

The rule is only for the server side though, right? Although if it's the best tool available, it would be nice to get it productionized (as long as that has no negative effect on development).

HouseBlaster updated the task description. (Show Details)Feb 28 2023, 5:50 PM

HouseBlaster subscribed.

DFlhb subscribed.Mar 30 2023, 12:47 AM

Aklapper removed a project: PageTriage.Oct 17 2023, 9:06 PM

Aklapper edited projects, added PageTriage; removed Growth-Team.Oct 17 2023, 9:11 PM

Hey_man_im_josh subscribed.Dec 7 2023, 1:59 PM

DFlhb unsubscribed.Jan 31 2024, 1:46 PM

DFlhb subscribed.

• ppelberg mentioned this in T359107: Copyvio Check: Prompt people pasting text to consider risk of copyright violation.Mar 4 2024, 9:05 PM