Page MenuHomePhabricator

Implement "last known good version" API
Open, LowPublic

Description

This has been a requirement from at least one internal partner. Basically, they'd like to be able to request a non-vandalized page. @dr0ptp4kt suggested that I solicit comments on the state of the algorithm from a few of you who have thought about this.

Background information

This has been a requirement from at least one internal partner.

What

They'd like to be able to request a non-vandalized page.

How

Description

  • If the article had its edit protection level lowered or expire in the last 3 days, elect to instead pull, format, and update the information from the latest revision saved before the given protection was lowered or expired. This assures that we're using the most accurate revision that isn't the subject of vandalism.
  • If the source article is protected and pending approval of changes, use only the latest revision that has been approved by a reviewer.
  • Always pull information using the most stable revision in the last 2 days, meaning that you use the revision that was live for the longest amount of time in the last 2 days instead of using just the newest. This revision will have the highest probability that its content and information is accurate and free of vandalism, since it's the longest revision in the last 2 day period of time that was allowed to sit and be readable by the public.
  • Use a previous revision or further previous revisions if the one to be chosen for updating the information is an edit reverting a previous edit via rollback, or an edit that was made immediately before an edit rolling something back.... especially if the rollback was made immediately afterward by respected bots such as ClueBot NG.

  • Newly created articles should have a cooling-off period to ensure that the community has had a chance to review them (7 days). Before that period, we can act as though the article did not exist.

Pseudo code - description of points above as algorithm

for each article edit in target set of articles:
  if protected or was-protected in last 3 days: ‘do not update cache’ - return
  else if is reverting another edit and is a bot edit: ‘do not update cache’
  else if article was created within the last 7 days: ‘update after 7 days with last stable’
  else (regular edit to an old article): ‘update if has not been reverted in prior 2 days’

References and other notes

How to get the last stable revision (referenced above)

Based on https://meta.wikimedia.org/wiki/Research:Revert, process the most recent 3 days of activity on the article.

if there’s no activity in the last 3 days:
  return the most recent revision
else:
  return the most recent revision that is not: reverted or a reverting bot edit

How to get the last N days of an article (based on a date that was N days ago):

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Anachronism&rvlimit=100&rvprop=timestamp|user|ids|sha1&rvend=20180608000000

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Donald%20Trump&rvlimit=100&rvprop=timestamp|user|ids|sha1&rvend=20180608000000

Check if an edit was performed by a bot (“bot edit”)

Query the Wikipedia API for the relevant language using the username (note that it appears twice) (works for all bots)

https://en.wikipedia.org/w/api.php?action=query&list=allusers&aufrom=Username&auto=Username&auprop=groups

if ‘bot’ in “groups”:
  return True
else:
 return False

Example query for ClueBot NG (a bot):

https://en.wikipedia.org/w/api.php?action=query&list=allusers&aufrom=ClueBot%20NG&auto=ClueBot%20NG&auprop=groups

Example query for EpochFail (not a bot)

https://en.wikipedia.org/w/api.php?action=query&list=allusers&aufrom=EpochFail&auto=EpochFail&auprop=groups

Open questions

  • Should the time periods mentioned in the algorithms customisable per wiki language? T203127#4976367
    • Which time periods for which wikis?
  • What are reputable bots on each wiki project?

Event Timeline

I should also say this is not urgent, I've got this in a todo list and need to move to phab!

We should probably start by putting the algorithm that @Halfak gave to @DFoy on mediawiki.org somewhere (maybe with a {{draft}} tag). Just having a recipe for finding a revision that is unlikely to contain bad faith edits would go a long way for most folks who are attempting to use content from the projects in their products.

Aklapper renamed this task from Implement last known good version API to Implement "last known good version" API.Aug 30 2018, 9:00 AM

Probably an epic.

On flagrev wikis it should probably just return the stable revision. On wikis which have the ORES damaging model tuned, maybe return the last non-damaging revision? On other wikis maybe just the last revision that has stuck around for while and/or was done by a non-fresh account?

If we are talking about the last good HTML snapshot (RESTBase tid), things get even more fun. Article HTML can change when Wikibase claims, templates, images or (in theory, although this one doesn't really happen) Lua modules get vandalized. MediaWiki does not identify different renderings of the same page revision, RESTBase at least has a stable UUID for them; but in neither case is there a way to tell what template/etc revisions have been used to render the page (for a recursive known-good check, although would probably unrealistic to do in real time anyway), much less to request a specific rendering of a revision in any non-blind way; the images or Wikibase claims might even be from a different wiki (and it's a much-requested feature for templates as well). Flagrevs has some pretty complex machinery to deal with this (where unreviewed template/image changes require an extra article review to appear), but the code is ancient and unmaintained; I don't think it deals with Lua or Wikibase or images transcluded from Commons. And outside MediaWiki Flagrevs is not universally supported at the moment (see e.g. T169116: Support flagged revisions in RESTBase).

Hi @dr0ptp4kt & @Tnegrin, can you clarify a bit more who needs this and what they are expecting?

Is this about flagged revisions, or about some possible different project with other tools like ORES

We should probably start by putting the algorithm that @Halfak gave to @DFoy on mediawiki.org somewhere (maybe with a {{draft}} tag). Just having a recipe for finding a revision that is unlikely to contain bad faith edits would go a long way for most folks who are attempting to use content from the projects in their products.

Here's the algorithm I mentioned:

Retrieving a non-vandalized article revision

  • If the article had its edit protection level lowered or expire in the last 3 days, elect to instead pull, format, and update the information from the latest revision saved before the given protection was lowered or expired. This assures that we're using the most accurate revision that isn't the subject of vandalism.
  • If the source article is protected and pending approval of changes, use only the latest revision that has been approved by a reviewer.
  • Always pull information using the most stable revision in the last 2 days, meaning that you use the revision that was live for the longest amount of time in the last 2 days instead of using just the newest. This revision will have the highest probability that its content and information is accurate and free of vandalism, since it's the longest revision in the last 2 day period of time that was allowed to sit and be readable by the public.
  • Use a previous revision or further previous revisions if the one to be chosen for updating the information is an edit reverting a previous edit via rollback, or an edit that was made immediately before an edit rolling something back.... especially if the rollback was made immediately afterward by respected bots such as ClueBot NG.

  • Newly created articles should have a cooling-off period to ensure that the community has had a chance to review them (7 days). Before that period, we can act as though the article did not exist.

Pseudo code - description of points above as algorithm:

for each article edit in target set of articles:
  if protected or was-protected in last 3 days: ‘do not update cache’ - return
  else if is reverting another edit and is a bot edit: ‘do not update cache’
  else if article was created within the last 7 days: ‘update after 7 days with last stable’
  else (regular edit to an old article): ‘update if has not been reverted in prior 2 days’

How to get the last stable revision (referenced above)
Based on https://meta.wikimedia.org/wiki/Research:Revert, process the most recent 3 days of activity on the article.

if there’s no activity in the last 3 days:
  return the most recent revision
else:
  return the most recent revision that is not: reverted or a reverting bot edit

How to get the last N days of an article (based on a date that was N days ago):
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Anachronism&rvlimit=100&rvprop=timestamp|user|ids|sha1&rvend=20180608000000

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Donald%20Trump&rvlimit=100&rvprop=timestamp|user|ids|sha1&rvend=20180608000000

Check if an edit was performed by a bot (“bot edit”)
Query the Wikipedia API for the relevant language using the username (note that it appears twice) (works for all bots)

https://en.wikipedia.org/w/api.php?action=query&list=allusers&aufrom=Username&auto=Username&auprop=groups

if ‘bot’ in “groups”:
  return True
else:
 return False

Example query for ClueBot NG (a bot):
https://en.wikipedia.org/w/api.php?action=query&list=allusers&aufrom=ClueBot%20NG&auto=ClueBot%20NG&auprop=groups

Example query for EpochFail (not a bot)
https://en.wikipedia.org/w/api.php?action=query&list=allusers&aufrom=EpochFail&auto=EpochFail&auprop=groups

This sort of algorithm is in use in several prominent high scale media properties, but people are recreating the work in their specific cases, as opposed to having one easy-to-call API that reflects this line of thinking. The idea was to expose something that, given a title, produces the correct revision. I strongly agree that it should also take into consideration whether that last correct revision is reportedly non-damaging (and scrub backwards further if so), as sometimes humans can't keep up with the backlog.

I don't see this as an urgent priority, although planning it as a small piece of work for a future quarter would be fine. We could then share this with the mailing lists and contacts we have at places where people are employing these sorts of algorithms in their own code.

I generally agree that exposing something like this is a good idea. We'll likely want to have different rules on per-wiki basis. E.g. in English Wikipedia, you only need to wait 48 hours (max) to see if an edit will be reverted with high confidence. In smaller wikis with less robust patrolling, you'll likely want to wait longer to confirm an edit as good. There are also likely some improvements we can make the algorithm to help 3rd party reuse remain fresher for such wikis. E.g. by looking at the edit history and user groups of an editor.

FWIW, there are many wikis where similar algorithms help editors prioritize flaggedrevs review. E.g. in fiwiki, there's an bot that patrols edits using ORES and a few heuristics.

Generally, this idea of "freshness" of a wiki is a good indicator that we might use to track some of the emerging wikis. If a wikis is growing quickly, but patrolling seems to be behind, then it might be worth investing in outreach/research to understand why and get the right support in place. This sounds like an interesting research project by itself, so I'll add it to my list of pitches that I send to researchers.

Jhernandez added a project: Epic.
Jhernandez updated the task description. (Show Details)

Ok, this makes a lot more sense now. Thanks all for all the extra information. I've summarised and formatted it in the description, feel free to edit.

Also brought it back to our Epic column and adjusted the priority based on reality (we won't be working on this soon).

If you have real use cases in mind it would be helpful to list them and add them to the description to help prioritise this when possible.

Adding Content-Transform-Team as Product-Infrastructure-Team-Backlog-Deprecated has been deprecated for a while, and as open valid tasks shall not be ignored and forgotten only because WMF internally reorgs without much change management in place (cf T328586).