Overview
The process by which content is reverted on Wikipedia can take multiple forms. One way of describing these approaches is the speed at which they occur. At one extreme are filters that block edits prior to publishing and at the other extreme (and far rarer) are edits that are only discovered serendipitously by editors perhaps years after being published. For the purposes of this task, we will consider four speeds of patrolling:
- Prepublication Filters: various systems are in place to prevent edits from being published if they match some set of heuristics. This includes AbuseFilter, SpamBlacklist, and TitleBlacklist. While these are quite coarse, they can catch certain types of recurring vandalism. Newer forms of this approach that are more oriented towards helping editors are also being developed as part of the Edit check work.
- Note that this is different from Flagged Revs implementations that only show edits to readers once they have been patrolled, though that can also reasonably be called prepublication moderation (but in the context of readers).
- Automated Patrolling: once an edit has been published, there are additional automated systems in place that can revert the edit (often in a matter of seconds). Automoderator is the most centralized form of this but various bots have been doing this work for well over a decade -- e.g., Cluebot NG.
- Fast Patrolling: if edits make it past the automated filters, there are still some mechanisms through which they might be reverted quite quickly (seconds to minutes) but manually by editors. This generally means that editors are patrolling edits via RecentChanges or their personalized Watchlist. Editors might also use various tools such as Huggle.
- Slow Patrolling: this is generally more serendipitous patrolling where an editor encounters vandalism through exploring Wikipedia or investigating a user who has made other concerning edits. While there is no hard boundary between fast and slow patrolling, this generally will be on the order of days.
Note: this categorization is heavily inspired by the fast vs. slow patrolling described in this 2019 report but I interpret fast vs. slow a bit differently.
Task
The goal of this task is to better understand the relative importance of these different levels of patrolling across various wikis. This extends this analysis that gathered the various time-to-revert for a number of language editions in that this task will also examine prepublication, distinguish between automatic and human reverts, and examine various metadata that are relevant to these processes.
Data sources to compile are below. For each, we will want the volume of activity, details on how much time passed between the revision and the particular action, and information about the actor who took the action:
- Prepublication:
- Spam hits can be found in the logging table.
- AbuseFilter hits can be found in the abuse_filter_log table. There are multiple actions that AbuseFilter can take -- disallow is the relevant action here.
- Automated:
- Automoderator uses a local account of each wiki on which it is enabled to revert edits. Tracking this data then requires tracking any reverts made by these usernames. Further documentation can be found at: https://www.mediawiki.org/wiki/Moderator_Tools/Automoderator/Unified_Activity_Dashboard. Note that Automoderator does not always have a bot flag, hence why it needs a special approach.
- Other bot-based reverts can be found by counting any reverts made by accounts that are flagged as bots (ensuring that you are not double-counting Automoderator in cases where it is treated as a bot). The event_user_is_bot_by_historical field in the mediawiki history dumps is the best place to gather this information as it will tell you that user's bot status at the time of the edit. The event_user_text field can be used for Automoderator.
- Fast+Slow Patrolling:
- Reverts: these can also be easily extracted from the mediawiki history dumps using the field revision_is_identity_revert field (while probably excluding any that are also marked as revision_is_identity_reverted). For completeness (see T266374 for more details), you can also double-check for the presence of mw-undo, mw-rollback and mw-manual-revert while skipping any mw-reverted edits. While reverts that get caught in an edit war may be legitimate patrolling in many cases, the subsequent reverts are likely far simpler to make than the original and can reasonably be grouped together as one event.
- Patrolled edits: edits marked as "patrolled" – i.e. reviewed and okay to not be reverted – are done through a variety of tools and according to a variety of norms depending on the wiki. Some of these can be found via the patrol log action in the logging table. For example, English Wikipedia patrols new articles via PageTriage and will mark them as patrolled if accepted. Other wikis can do this via RecentChanges and apply it to more than just new articles. There there is Flagged Revisions on some wikis or its simpler variant Pending changes on English Wikipedia. This also allows for marking edits as patrolled, but the resulting data can be found in the review log action in the logging table. This is probably the most complicated space and further exploration might be needed to make sure more data is not being missed.
- For those with webrequest access, it might be possible to also collect data on how often edit diffs are viewed through requests with the diff= parameter of the Compare API. This would take some work to standardize and correlate with data about patrolling, but could provide information about whether these explicit flags reduce the amount of time editors spend re-reviewing edits.
- Note: from a data perspective, it is very hard to distinguish between fast vs. slow patrolling in that it is generally not possible to infer how the editor in question discovered the vandalism. We must instead rely on a more tautological approach where you assign actions to these categories based on time-to-revert. But we can imagine a scenario where an editor coincidentally loads a page immediately post-vandalism and reverted the damage in seconds, or conversely, an editor uses RecentChanges on a smaller wiki but reverts an edit hours after it has been published.
Once the data is collected across these sources, the amount of activity for each stage in each wiki can be normalized by the total number of edits made on that wiki in that time period. Additional data about the type of actor should also be gathered -- e.g., bots vs. editors; level of editor experience. This will hopefully begin to give some insight into how wikis vary in how much they rely on patrolling in each of these areas. I would recommend starting with a smaller time period (e.g., day) given the complexity of gathering these different datasets before extending to longer time periods.
Acceptance Criteria
- The output of this task will be a Meta page with details on the methodology and findings. Care should be taken with private data sources as they contains private data – the Data publication guidelines will need to be followed before any public data sharing.