Page MenuHomePhabricator

Copyvio: make links to CopyPatrol work for moved pages
Closed, ResolvedPublic

Description

@SBisson @Catrope @Etonkovidova -- I am doing some testing of copyvio detection in production. What I'm doing is looking at all the pages in New Pages Feed flagged with the issue, then making sure they are in CopyPatrol and vice versa. It mostly checks out, except for the two pages listed here:

image.png (452×1 px, 141 KB)

These two do not have entries in CopyPatrol. Why do they have flags in the New Pages Feed?

Event Timeline

@eranroz Any idea what can cause these pages to be in PageTriage but not in CopyPatrol?

As @Etonkovidova discovered, these pages were renamed. They're in CopyPatrol under their old names, not their new names.

To fix the links to CopyPatrol on Special:NewPagesFeed, I propose that we change the format of the copyvio tag (in pagetriage_page_tags) from revid to revid|pagetitle, e.g. 864370858|Meg_Mullach. That would give us the name the page had at the time it was tagged (which is the one CopyPatrol expects), and it'd also be backwards-compatible (older rows would only have the revid).

I do think we should do something about this because we desire that whenever a user clicks on the "Copvio" link in the New Pages Feed, they should wind up looking at something in CopyPatrol, not an empty screen.

Moving to "Upcoming Work".

To fix the links to CopyPatrol on Special:NewPagesFeed, I propose that we change the format of the copyvio tag (in pagetriage_page_tags) from revid to revid|pagetitle, e.g. 864370858|Meg_Mullach. That would give us the name the page had at the time it was tagged (which is the one CopyPatrol expects), and it'd also be backwards-compatible (older rows would only have the revid).

I think that could work. I'm not thrilled about putting title and rev ID in the same field but it's probably OK for this. In this approach we would then modify copyvio_link_url to not include drafts as a query param? Would this approach handle the (probably rare) scenario of someone creating Draft: Foo with copyvio, then adding a new revision with more copyvio and a title change to Draft: Bar?

A possible alternative would be to add searchCriteria=rev and revId= parameters around here, so the Meg Mullach URL on NewPagesFeed would change from:

  • https://tools.wmflabs.org/copypatrol/en?filter=all&searchCriteria=page_exact&searchText=Meg%20Mullach&drafts=1

to

  • https://tools.wmflabs.org/copypatrol/en?filter=all&searchCriteria=diff&diffId=864370858

The advantage is we have the latest rev ID with copyvio in the MediaWiki DB. The disadvantage is that the Copyvio URL then only shows the most recent copyvio report for a given page (here's an example where there are several), but maybe that's not a big problem.

MMiller_WMF renamed this task from Copyvio: a couple articles missing from CopyPatrol to Copyvio: make links to CopyPatrol work for moved pages.Oct 24 2018, 6:06 PM

In this approach we would then modify copyvio_link_url to not include drafts as a query param? Would this approach handle the (probably rare) scenario of someone creating Draft: Foo with copyvio, then adding a new revision with more copyvio and a title change to Draft: Bar?

Why would we do that, and why would it help? As I see it, all the drafts param is is a namespace indicator: if you want CopyPatrol to show you edits for [[Foo]] you have to set searchText=Foo&drafts=0, to show [[Draft:Foo]] you need searchText=Foo&drafts=1, and if you omit the drafts param it'll show you both. I don't think the rename issue is fundamentally related to this, except that in the particular case where we noticed this bug, the move was from Foo to Draft:Foo.

The Growth team discussed this today. We decided to do the following:

  1. Change the copyvio_link_url variable to include the revision ID. So, for example this URL in PageTriage UI (https://tools.wmflabs.org/copypatrol/en?filter=all&searchCriteria=page_exact&searchText=The%20Cross%20(nightclub)&drafts=1) would change to https://tools.wmflabs.org/copypatrol/en?filter=all&searchCriteria=page_exact&searchText=The%20Cross%20(nightclub)&drafts=1&revision=865193757
  2. In the CopyPatrol repository, we will want to modify [getPlagiarismRecords()](https://github.com/wikimedia/CopyPatrol/blob/377912a5e997c207e6f6b31083302de340f08386/src/Dao/PlagiabotDao.php#L77) to check for revision in the query params, and include the match for the revision in the query in addition to any page_exact matches found for the title.

Change 471324 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/PageTriage@master] Add revision ID to CopyPatrol query string

https://gerrit.wikimedia.org/r/471324

Change 471324 merged by jenkins-bot:
[mediawiki/extensions/PageTriage@master] Add revision ID to CopyPatrol query string

https://gerrit.wikimedia.org/r/471324

@MusikAnimal @Niharika I can't add you as reviewers in GitHub so I'm adding you here, and sent a ping via IRC as well.

@MusikAnimal @Niharika I can't add you as reviewers in GitHub so I'm adding you here, and sent a ping via IRC as well.

I'll take a look. Thanks for the heads up!

@Etonkovidova I pushed this change to production, you can see for example https://tools.wmflabs.org/copypatrol/en?filter=all&drafts=1&searchText=Meg+Mullach&searchCriteria=page&revision=864370858.

The modified links on enwiki's Special:NewPagesFeed won't be in place until Thursday, but if you don't want to wait until then to test, you can look up the copypatrol tag (revision ID) in the DB for any item in the NewPagesFeed and append it to the URL query with &revision={revId}.

@kostajh I tested it in testwiki - tthree cases (1) renaming an article (2) moving Draft to Article (3) moving Article to Draft (that was the problem described in this task)

(1) and (2) - work fine; Special:NewPagesFeed updates promptly, all info (ranking) is saved; the Copyvio link displays CopyPatrol with info as before the move.

(3) moving Article to Draft presents an issue:

Result: CopyPatrol displays some information.

Screen Shot 2018-11-07 at 4.39.27 PM.png (338×1 px, 85 KB)

Result: CopyPatrol page is blank. If I remove the check mark from 'Drafts only' and click 'Submit' - the correct result appears.

@Etonkovidova that example is not working because revision ID of 361919 does not exist in the copyright_diffs table that CopyPatrol uses. Where did that revision ID come from, is it something you posted via the API?

@kostajh
(1) CopyPatrol finds 36919 with https://tools.wmflabs.org/copypatrol/en?filter=all&searchCriteria=page_exact&searchText=Luna%20Park%20Sydney&drafts=0&revision=361919
(2) where copyright_diffs table exists?
(3) I was looking at the article in the NPP feed on testwiki :

mysql:research@s1-analytics-slave [testwiki]> select * from page where page_id=100732\G
*************************** 1. row ***************************
              page_id: 100732
       page_namespace: 0
           page_title: Luna_Park_Sydney

In testwiki , 19 | copyvio , so

 [testwiki]> select * from pagetriage_page_tags where ptrpt_page_id=100732 and ptrpt_tag_id=19;
+---------------+--------------+-------------+
| ptrpt_page_id | ptrpt_tag_id | ptrpt_value |
+---------------+--------------+-------------+
|        100732 |           19 | 361919      |
+---------------+--------------+-------------+
1 row in set (0.00 sec)

@kostajh - yes, https://tools.wmflabs.org/copypatrol/en?filter=all&searchCriteria=page_exact&searchText=Meg%20Mullach&drafts=1&revision=864370858 displays the result now.

There are two cases that still need to be sorted out:

(1) On testwiki - Los Angeles Lakers B is marked as Copyvio:

Screen Shot 2018-11-13 at 4.38.10 PM.png (271×1 px, 71 KB)

It is in the in the 'Potential copyright violation log':
Screen Shot 2018-11-13 at 4.41.34 PM.png (374×971 px, 65 KB)

The copyvio link redirects to the blank CopyPatrol page: https://tools.wmflabs.org/copypatrol/en?filter=all&searchText=Los+Angeles+Lakers+B&searchCriteria=page
Screen Shot 2018-11-13 at 4.45.07 PM.png (549×1 px, 103 KB)

(2) Checking CopyPatrol handling moving Gothic architecture -> Draft:Gothic architecture

Screen Shot 2018-11-13 at 5.01.27 PM.png (550×1 px, 129 KB)

  • moving * Gothic architecture -> Draft:Gothic architecture**: the article is moved and displayed with copyvio on AfC on NPP.

Clicking on copyvio link will display blank CopyPatrol page with marked 'Drafts only` :
https://tools.wmflabs.org/copypatrol/en?filter=all&searchCriteria=page_exact&searchText=Gothic%20architecture&drafts=1&revision=361925

Screen Shot 2018-11-13 at 5.09.10 PM.png (370×1 px, 100 KB)

If I un-check the 'Drafts only' check box and re-load the page - all info is displayed correctly. Un-checking the box and re-loading the page is not superintuitive - that's all.

@Etonkovidova what I was trying to say earlier is that this can't really be tested on testwiki since the bot is only posting results to enwiki.

Earlier you asked, (2) where copyright_diffs table exists? -- the answer is on CopyPatrol, outside of enwiki.

The reason you are getting blank pages on CopyPatrol is because the revision ID that you're passing doesn't exist in the copyright_diffs table in CopyPatrol. In production, it should always be there.

The order of operations is:

  1. Eranbot listens to recentchanges feed and analyzes revisions by passing text to iThenticate
  2. Copyvio is found, so Eranbot stores some data in the copyright_diffs table on Toolforge. The diff column on this table contains the revision ID. The CopyPatrol web UI shows results from this database table.
  3. Eranbot then posts data to the PageTriage Copyvio API endpoint, and on enwiki, we store the revision ID from the previous step as the "Copyvio tag" value.
  4. In the PageTriage UI, we generate links for items flagged as having copyvio. Those links contain a query parameter for ?revision={revisionId} where {revisionId} is the value of the copyvio tag that corresponds to the diff column on the copyright_diffs table in CopyPatrol
  5. When you click the link from PageTriage, you're kicked over to CopyPatrol where CopyPatrol runs a SQL query on copyright_diffs. It searches for exact match, draft or not draft, etc. Let's say it finds 0 results. If the revision parameter is set in the URL, then it will _also_ query for diff = {revision}, it would return 1 result, because there will always be a result for the revision ID matching the diff column on the copyright_diffs table.

Thanks, @kostajh - based on

that this can't really be tested on testwiki since the bot is only posting results to enwiki

I am closing as Resolved.