Page MenuHomePhabricator

[4 hours] Can we figure out when redirects become complete articles and the previous reviewed state?
Closed, ResolvedPublicSpike

Event Timeline

Niharika renamed this task from [Investigation - 8 hours] Can we figure out when redirects become complete articles? to [Investigation - 4 hours] Can we figure out when redirects become complete articles?.Jun 6 2019, 5:35 PM
Niharika created this task.
Niharika moved this task from Untriaged to Estimated on the Community-Tech board.
Niharika renamed this task from [Investigation - 4 hours] Can we figure out when redirects become complete articles? to [4 hours] Can we figure out when redirects become complete articles?.Jun 6 2019, 5:41 PM
Niharika added a project: Spike.
Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJun 6 2019, 5:41 PM
MusikAnimal added a subscriber: MusikAnimal.EditedJun 7 2019, 7:55 PM

If you just need to find out if/when a given edit changed an article to a redirect, that is done with https://github.com/wikimedia/mediawiki-extensions-PageTriage/blob/master/includes/Hooks.php#L105-L106 But of course this is currently broken (T223828), and fixed with https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/514108, which is still blocked by the flaky test at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/514105

If you need to find out if an article was ever a redirect, that might be tricky. One idea is to query for older revisions that have the mw-removed-redirect tag associated with them. This probably isn't very efficient, and it would only work for revisions dating back to around December 2017 when these tags were first introduced in MediaWiki.

JTannerWMF moved this task from Inbox to External on the Growth-Team board.Jun 11 2019, 6:02 PM

Looks like Community-Tech is working on this so the Growth team will put this in External

MusikAnimal renamed this task from [4 hours] Can we figure out when redirects become complete articles? to [4 hours] Can we figure out when redirects become complete articles and the previous reviewed state?.Jun 12 2019, 12:03 AM
MusikAnimal claimed this task.EditedJun 12 2019, 12:21 AM

We also need to make sure the previous state was reviewed. So consider this scenario:

  • 5 years ago the redirect [[Foo]] was created and reviewed (so it's no longer in PageTriage)
  • Vandal turns [[Foo]] into an article with multiple edits
  • [[Foo]] is added to the queue as an unreviewed article
  • Patroller reverts vandal and it goes back to the redirect state

In this case we want to [[Foo]] to removed from the queue and/or automatically mark it as reviewed, since it was reverted back to a previously reviewed version. I don't think the tactics in T225239#5243761 are enough. Human reviews/unreviews create are logged (as in the logging table), but no such log entry is created when the PageTriage system adds a page to the queue.

We might consider creating a new "tag" that is the SHA1 hash of the content of a reviewed revision. Let's call it reviewed_sha. Revisiting the above example, when the vandal first turns the redirect [[Foo]] into an article, we'd create the reviewed_sha tag with the SHA1 of the previous edit (the version that was reviewed). Then when the patroller reverts it back, we compare the SHA and if it matches we know we're back to the reviewed version and can safely mark it as reviewed. This is a little scary because we'd need to do these comparisons with every edit. If it's in a deferred update I think it should be okay performance-wise.

The same applies for manual reviews on newly created pages:

  • [[Foo]] is created as a redirect
  • A patroller marks it as reviewed
  • Create the reviewed_sha tag
  • Vandal turns it into an article
  • It's added back to the queue
  • Patroller reverts back to previously reviewed state
  • We compare the SHA, see that it matches, and auto-review the page.

This would be my approach. Let me know what you think!

No comment on the technical side as I know my own limitations but I'm not seeing any inherent process issue. The one piece I would encourage you to think about is how autopatrolled redirects are handled.

@Barkeep49 I imagine it would be

  • An autopatrolled user creates a redirect, and the reviewed_sha tag is created automatically
  • Vandal turns it into an article...

Something to note is that it shouldn't require that the user who reverts the page be a "patroller" in the user rights sense - anyone should be able to do this
Also, what will happen to the currently patrolled redirects - will they all have reviewed_sha tags populated?

I also imagined that, just wanted to point autopatrol out in case it played out different technically.

As for existing ones, ideally they get grandfathered in with that tag. If that presents challenges, well going forward is better than nothing.

We should be able to use a maintenance script to add the tag to existing redirects in the queue. For old ones that are no longer in the Page Curation database, the tag for the redirect would get added once it gets changed into an article (and hence back in the Page Curation database).

That reminds me, we may not need to use SHAs at all. We can fetch the target of the redirect and use that. This is to negate trivial differences, say:

  • Redirect is turned into an article
  • Patroller restores the redirect but also adds a {{R from misspelling}} template or what have you, in the same edit
  • Redirect is still automatically marked as reviewed

I'm not 100% certain this is desired, but it may be helpful in that it still prevents patrollers from having to re-review a page that otherwise shouldn't be in the queue. E g. If an editor added {{R from misspelling}} to an existing redirect, it does not get added to the queue (because it's still a redirect).

I think that would be great. @DannyS712 has done some work around bot patrolling of these easy to patrol redirects and this seems like it would complement those efforts.

@MusikAnimal but that would require keeping track of all possible templates that can be safely ignored. Image the following:

#REDIRECT [[Foo]]

{{R from misspelling}}

Insert BLP name here is an idiot

The fact that the original redirect to foo was okay does not mean that this new page, which also includes other content, should be automaticall patrolled

@MusikAnimal but that would require keeping track of all possible templates that can be safely ignored. Image the following:

#REDIRECT [[Foo]]
{{R from misspelling}}
Insert BLP name here is an idiot

The fact that the original redirect to foo was okay does not mean that this new page, which also includes other content, should be automaticall patrolled

Yes I had the same thought, but my point is going by the target of the redirect would match the current behaviour of what makes a redirect->article appear in the feed in the first place. I suspect any sort of vandalism would naturally be picked up by recent changes patrollers.

Anyway, we don't need to go by the target, it was merely an idea. I just know that partial reverts, etc., are not uncommon so you might still see redundant pages in the queue waiting for review, when they were nearly identical to the previously reviewed version. Any change, as minute as whitespacing, will mean the revision has a different SHA.

DannyS712 moved this task from Unsorted to Others on the User-DannyS712 board.
MusikAnimal closed this task as Resolved.Jul 3 2019, 5:48 PM

We have decided storing the SHA is the best route to go, as outlined at T225239#5251954