This has been a requirement from at least one internal partner. Basically, they'd like to be able to request a non-vandalized page. @dr0ptp4kt suggested that I solicit comments on the state of the algorithm from a few of you who have thought about this.
Background information
This has been a requirement from at least one internal partner.
What
They'd like to be able to request a non-vandalized page.
How
Description
- If the article had its edit protection level lowered or expire in the last 3 days, elect to instead pull, format, and update the information from the latest revision saved before the given protection was lowered or expired. This assures that we're using the most accurate revision that isn't the subject of vandalism.
- If the source article is protected and pending approval of changes, use only the latest revision that has been approved by a reviewer.
- Always pull information using the most stable revision in the last 2 days, meaning that you use the revision that was live for the longest amount of time in the last 2 days instead of using just the newest. This revision will have the highest probability that its content and information is accurate and free of vandalism, since it's the longest revision in the last 2 day period of time that was allowed to sit and be readable by the public.
- Use a previous revision or further previous revisions if the one to be chosen for updating the information is an edit reverting a previous edit via rollback, or an edit that was made immediately before an edit rolling something back.... especially if the rollback was made immediately afterward by respected bots such as ClueBot NG.
- Newly created articles should have a cooling-off period to ensure that the community has had a chance to review them (7 days). Before that period, we can act as though the article did not exist.
Pseudo code - description of points above as algorithm
for each article edit in target set of articles: if protected or was-protected in last 3 days: ‘do not update cache’ - return else if is reverting another edit and is a bot edit: ‘do not update cache’ else if article was created within the last 7 days: ‘update after 7 days with last stable’ else (regular edit to an old article): ‘update if has not been reverted in prior 2 days’
References and other notes
How to get the last stable revision (referenced above)
Based on https://meta.wikimedia.org/wiki/Research:Revert, process the most recent 3 days of activity on the article.
if there’s no activity in the last 3 days: return the most recent revision else: return the most recent revision that is not: reverted or a reverting bot edit
How to get the last N days of an article (based on a date that was N days ago):
Check if an edit was performed by a bot (“bot edit”)
Query the Wikipedia API for the relevant language using the username (note that it appears twice) (works for all bots)
if ‘bot’ in “groups”: return True else: return False
Example query for ClueBot NG (a bot):
Example query for EpochFail (not a bot)
Open questions
- Should the time periods mentioned in the algorithms customisable per wiki language? T203127#4976367
- Which time periods for which wikis?
- What are reputable bots on each wiki project?