Page MenuHomePhabricator

Incorporate Earwig's copyright detector into PageTriage
Open, Needs TriagePublicFeature

Description

Earwig's copyright detection tool: https://copyvios.toolforge.org/

At first I dismissed incorporating this tool into PageTriage too difficult on the technical side. But there are multiple benefits that would justify some time and money spent on this.

Keep in mind, this tool is probably the best copyright detector, and is the one we encourage our patrollers to run manually on every new article.

Benefits of incorporating it into PageTriage:

  • We could methodically scan every new mainspace article and draft with it, achieving a 100% check rate. No chance of forgetting.
  • It usually takes around 15-40 seconds to run. Doing this check automatically before an NPP and AFC get to the article/draft would save time.
  • Incorporating it into PageTriage instead of third party means fewer clicks.
  • There are legal benefits to Wikipedia and WMF to being very methodical with copyright.

Technical challenges:

  • Sounds like Earwig's tool uses a search engine API, and that there is some kind of daily limit, after which the tool shuts off. We'd have to investigate the details of this limit (would money raise this limit? how much does it cost? do we already spend money on this or are we using a freemium API?)
  • Tool is currently third party. To incorporate it would take some work.
  • Coding it to only run once per article (rather than once per view, once per revision, etc.) would necessitate some kind of background process when the article is first created.
  • Depends on an external API, which may sometimes go down, be slow, etc.

Implementation ideas (similar to T330346: Detection and flagging of articles that are AI/LLM-generated)

  • We could create something similar to the pagetriagetagcopyvio API. Some external tool would use the EventStreams API or something similar to check every new article, then use the pagetriagetagcopyvio API to apply the tags
    • Note: look into what tool currently uses the pagetriagetagcopyvio API. Look at how that tool works. See if it as good as Earwig.
    • Note: see pagetriagetagcopyvio API parent task T199359
  • We could do the scan internally in PageTriage, adding the scan code somewhere near the code that sets the SQL pagetriage_page_tags (ArticleMetadata class), which as I recall already runs as a background process (it runs on article creation in PHP, and keeps running after output is sent to the browser, so the user notices no page load lag)

Credit to @Samwalton9 for the idea.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Could you clarify what you mean by 'incorporating'? Does this mean using an API or somesuch to get a result from Earwig and then presenting it to users in PageTriage, or fully duplicating the Earwig functionality into PageTriage?

No idea. Would like to discuss in this ticket. Any of the "implementation idea" bullets above are possible technical solutions.

Tool is currently third party. To incorporate it would take some work.

When we looked this a few years ago we decided we couldn't directly integrate Earwig's tool due to rules about calling non-production services form production.

We could look at the possibility of packaging its code https://github.com/earwig/copyvios for deployment in Wikimedia's Kubernetes cluster; then it would be possible to call its API directly from MediaWiki.

Tgr subscribed.

The rule is only for the server side though, right? Although if it's the best tool available, it would be nice to get it productionized (as long as that has no negative effect on development).