Page MenuHomePhabricator

Copyvio: backfilling on existing pages
Closed, DeclinedPublic

Description

@Etonkovidova brought this up.

We want all pages in the New Pages Feed to be checked for copyvio, including the several thousand pages that are already in there. This is similar to how we wanted all pages to have ORES scores applied: T198982

Therefore, this task is for backfilling copyvio checking on the pages in feed using the same business rules we will use on new pages: if any of their revisions have copyvio flags, the page has it. Hopefully, the CopyPatrol database will allow this to happen easily.

Event Timeline

My recommendation is to approach this from the CopyPatrol/Toolsforge side rather than NewPagesFeed/enwiki space. Suggested approach:

  1. Create a backfill script to run on the eranbot Toolforge space
  2. This script would operate on the 29,000 records in copyright_diffs that were created since 2016 and are not "fixed" or "false positive" status (select count(distinct(page_title)) from copyright_diffs WHERE diff_timestamp > 20160620000000 AND lang = 'en' AND status NOT IN ('fixed', 'falsepositive');). I know this doesn't precisely match the logic we have elsewhere (if any revision of a page ever had copyvio, whether it's fixed or not, we want to show "Potential copyvio" in the UI) but if we do it this way we have 29k records to churn through instead of 59k.
  3. The script will take each record in copyright_diffs and make a POST request to the PageTriage copyvio tag API. The PageTriage Copyvio Tag API takes care of checking that the revision ID exists, and that the page is in the PageTriage queue, and it will also generate a log entry for Special:Log and publish the log entry to Recent Changes, which is nice for transparency.
  4. However we'd need to think about how to avoid flooding the Recent Changes table and the maximum number of successful POST requests (which generate a log entry) that we want to make per hour. @MMiller_WMF any thoughts on that limit?
  5. I recommend running this script after T201073: Copyvio: Make Eranbot call PageTriage with copyvio info is in production, so we don't have to worry about possibly missing some records.

Other approaches

If we approach this from the enwiki / NewPagesFeed side, things are more complicated. I don't recommend we do it this way but here are notes that I wrote out while I was thinking through this.

We'd have to figure out which subset of pages we want to backfill data for. There are 116,523 reviewed and unreviewed pages across NPP (Article/User) and AfC. My assumption is that we don't need to backfill data for every page that's ever been in PageTriage.

AfC has 4,007 pages "Awaiting Review", so we definitely want to backfill those. Do we want to backfill data for the 20,272 unsubmitted drafts?

In NPP we have 25,721 unreviewed pages in the user namespace and 2,300 in the article namespace. We definitely want to backfill the article namespace; do we also want to backfill the user namespace pages?

Depending on the answers to the above we are backfilling data for a minimum of 6,307 pages (awaiting review AfC and unreviewed article namespace pages), but potentially it's much more (~34k - 116k).

1. Backfill on demand with a system user

When a user visits Special:NewPagesFeed, their browser sends a request to the PageTriage List API to get pages. On the backend, we could look to see if the copyvio tag exists in enwiki. If it doesn't, we could trigger a job that could check to see if this data is in CopyPatrol and then post the data to the PageTriage Tag Copyvio API endpoint using a user created with User::newSystemUser( 'PageTriage Copyvio backfill manager', [ 'steal' => true ] ).

Advantages/disadvantages/concerns
  1. We'd need to create a system user.
  2. Backfilling the data with a system user that posts to the API means there will be transparency to the community about when/how these scores were added. Because backfilling happens on demand, we wouldn't in theory flood the Recent Changes page.
    1. That also means backfilling will be an incremental, slower process.
  3. With this approach the copyvio indicator would be available sometime after the initial request, so a user might visit NewPagesFeed and not see any copyvio indicators, and then they'd refresh a few minutes later and would see copyvio indicators.
  4. The job spawned by PageTriage List API could make a GET request using the page title to https://tools.wmflabs.org/eranbot/plagiabot/api.py?action=suspected_diffs&page_title={title} which returns data like this:
[
    {
        "diff": "861150126",
        "project": "wikipedia",
        "ithenticate_id": "40125600",
        "diff_timestamp": "20180925131118",
        "page_ns": "0",
        "lang": "en"
    }
]

The diff value is the revision ID, which we can then use to post to the PageTriage copyvio API.

  1. After some period of time elapses, we could remove the code to backfill on demand as it would be redundant.

2. Backfill with a dump of data and a system user

A variation on the above is to generate a CSV/JSON dump of the data in the CopyPatrol database, and then use that as source for a maintenance script. The script would go through all the types of PageTriage pages we care about, look up to see if there's a matching entry in the CopyPatrol data dump, then post the data using a system user.

Advantages/disadvantages/concerns
  1. This would be the fastest way to import the data.
  2. People might be upset about the recent changes table getting flooded.
  3. We'd need to time things such that the dump of CopyPatrol data is generated fairly close to the time when we run the maintenance script, otherwise we'd miss some entries.

3. Backfill using MySQL queries

Variation on options 1 and 2, use straight SQL queries to backfill data. No record of the changes is left in Special:Log / Recent Changes.

Recommendation after discussing with @SBisson:

  1. Create a maintenance script to run on enwiki prod
  2. Create a dump (JSON or CSV format) of the 29,000 records in CopyPatrol's copyright_diffs table that were created since 2016 and are not "fixed" or "false positive" status.
  3. Maintenance script loads the dump from previous step. Iterate over each row in the dataset.
  4. Check if the page title is in PageTriage queue, if not, continue
  5. Call the PageTriage Tag Copyvio API with the revision ID, using Eranbot.
  6. Recommended running this script after T201073: Copyvio: Make Eranbot call PageTriage with copyvio info is in production, so we don't have to worry about possibly missing some records.

To clarify: are you proposing using a system user? I think we should, to create a clear separation between imports from CopyPatrol that we did and things that EranBot does later.

@Catrope I updated my comment to clarify that we would use Eranbot as the actor since that's what I had in mind.

However I'm fine creating a system user for this, and I like the idea that it's separated, so if you and @SBisson agree I'll update my recommended approach text with that.

@kostajh -- I was able to follow along with some of this ticket, but I just want to confirm. Which pages and namespaces and time periods will you be covering with this approach?

@MMiller_WMF I was thinking about this some more yesterday.

I think it would make most sense for the backfill approach to mimic the logic that we are implementing with Eranbot and sending data to PageTriage. That is, any time an article is flagged with copyvio and its data is stored in the CopyPatrol DB, we mark the page has "potential copyvio" in PageTriage. The "potential copyvio" flag shows in PageTriage regardless of (1) whether that revision was marked as a false positive, or (2) further revisions fixed the problematic copy.

Therefore, I propose to use this query to obtain the first revision that appears for each unique page in the CopyPatrol database, regardless of its status:

1SELECT s1.diff, s1.page_title
2FROM copyright_diffs s1
3JOIN
4 ( SELECT page_title, MIN(diff) AS diff FROM copyright_diffs WHERE lang = 'en' GROUP BY page_title)
5AS s2 ON s1.page_title = s2.page_title AND s1.diff = s2.diff;

This yields 71,125 page titles from English wikipedia along with the first revision ID that appeared in CopyPatrol DB. The output looks like

| 861429910 | François_Letellier                                                                                                                                                                                                                                             
| 861430995 | Royal_Hong_Kong_Police_Association                                                                                                                                                                                                                             
| 861432514 | NRL_salary_cap                                                                                                                                                                                                                                                 
| 861437321 | Peppy_Hare                                                                                                                                                                                                                                                     
| 861441273 | Drake_(musician)

The backfill script will iterate over each row. Logic that we'll use:

  1. Check if the revision ID exists in the Mediawiki DB
  2. Check if the page ID is in the page triage queue
  3. As a system user ("PageTriage Copyvio backfill user") call the API to add the copyvio data to PageTriage for the page
    1. Assumption: we should _not_ create log entries when backfilling, so that we don't flood the logs. Does that sound correct?

@kostajh - We probably want to skip cases where the revision in question has already been "deleted" (which really means hidden) as the reviewer won't be able to see the revision in question (unless they are an admin). This can be checked with the rev_deleted field. See https://www.mediawiki.org/wiki/Bitfields_for_rev_deleted. I agree with the assumption that we should not create log entries for this script run if possible.

Ug, that documentation page is actually terrible. (It's basically just a bunch of 10 year-old discussions.) See https://www.mediawiki.org/wiki/Manual:Revision_table#rev_deleted instead.

@kaldari thanks for the heads up. So, something like this?

$revisionStore = MediaWikiServices::getInstance()->getRevisionStore();
$revision = $revisionStore->getRevisionById( $params['revid'] );
if ( $revision->getVisibility() !== 0 ) {
    // skip row.
}

@kostajh -- your mention of "obtain the first revision" makes me feel like this would not actually be mimicking the logic of our implementation, because our implementation looks at all the revisions of a page, and flags the page if any are found in CopyPatrol.

Setting aside the technical implementation, what is the net effect of your approach? Which pages/namespaces/ages will be covered with what business rules?

your mention of "obtain the first revision" makes me feel like this would not actually be mimicking the logic of our implementation, because our implementation looks at all the revisions of a page, and flags the page if any are found in CopyPatrol.

@MMiller_WMF I'm not sure I see the difference here. Per T201073: Copyvio: Make Eranbot call PageTriage with copyvio info, Eranbot will check each diff to see if it's possible copyvio, then stores that info in the CopyPatrol DB and sends it to PageTriage Copyvio API, where we can then highlight the "Potential copyvio" flag in the PageTriage list UI.

With this logic, once T201073 is deployed to production, if "Draft:Some_awesome_page" with rev ID 123 has copyvio, but then rev 124 and rev 125 do not, the CopyPatrol DB will have rev 123 in the DB and PageTriage will show "Draft:Some_awesome_page" with "potential copyvio".

Accordingly, the backfill script does not need to look at all the diffs for each page title in the CopyPatrol DB. It doesn't especially matter since we are not doing anything (at the moment) with the revision ID in the PageTriage Copyvio API, but we could use the most recent revision for each page title, rather than the oldest one, if you'd prefer.

Setting aside the technical implementation, what is the net effect of your approach? Which pages/namespaces/ages will be covered with what business rules?

After backfilling data, I'm not sure of the total number of pages that will have "Potential copyvio" associated with them but it will be less than 71,125 – that's the number of distinct page titles for english wikipedia records in CopyPatrol DB. That will include draft/user/article namespaces going back to the beginning of the copypatrol DB records.

We could significantly reduce the number of pages we backfill data for by excluding:

  • User namespace pages
  • already Reviewed pages
  • AfC declined pages

Please let me know if you have questions or thoughts on what we backfill. I can keep working on the script while we hash out the particulars.

if ( $revision->getVisibility() !== 0 ) {
    // skip row.
}

Yeah, that looks right to me.

Change 463360 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/PageTriage@master] WIP: Refactor PageTriage Copyvio class, add backfill copyvio script

https://gerrit.wikimedia.org/r/463360

@kostajh and I just discussed this at length, and it caused us to take a step back and think about the pros and cons to backfilling in general. We now think that we should actually not backfill at all, unless it turns out that once this feature is in production, the reviewing community would prefer to backfill.

The main reason is that the majority of cases that would be backfilled have probably been resolved for a long time, either via CopyPatrol activities or other editor activities that search for and remove copyright violations. Reviewers would have little reason to revisit those cases, and they would probably just clutter up the feed. By not backfilling, the potential violations flagged in the feed will be just the new ones, making it easier to focus on new work.

We also discussed the business rule that would cause potential copyvio flags to not be removed from the New Pages Feed once resolved at CopyPatrol. While we see benefits to that rule as well downsides, we're going to keep that rule in place for now, and potentially reconsider depending on community usage and feedback.

kostajh removed kostajh as the assignee of this task.

We can re-open this later if necessary.

Change 463360 abandoned by Kosta Harlan:
WIP: Refactor PageTriage Copyvio class, add backfill copyvio script

Reason:
Task is declined, for now.

https://gerrit.wikimedia.org/r/463360