Page MenuHomePhabricator

[WE1.5.3] Wikipedia Patrolling Measurement
Closed, ResolvedPublic

Description

Hypothesis
If we develop heuristics-based data pipelines for measuring patrolling activity on Wikipedia, we can prototype a model to detect moderation gaps at scale.

Scope

  • Wikipedia
  • Largest language editions
  • Main namespace (article pages)
  • While focus is on newer moderators, measurement should cover all editor types. Lower priority on incorporating AbuseFilter (pre-publication filters) if that proves difficult.

Product requirements
There are no strict product requirements at the moment. However, the Moderator Tools team is ideating a project to organize on-wiki activity in a centralized way, including moderation practices. Given this proposal is still in early stage, this should not become a blocker to start prototyping patrolling activity metrics. On the contrary, this research is aimed to proactively provide insights to guide Product.

Draft output

  • Dataset of all edits for a month that captures the following:
    • Edit:
      • What was changed (we will start with issue templates from Q3 work and can add further functionality if needed and feasible)
      • Who performed the edit (edit count, user rights)
      • Edit review status (at time of snapshot)
    • Reviewer:
      • Who "reviewed" or followed up on the edit (edit count, user rights) where applicable
      • Time to review
      • Time to next edit
      • Policies cited in edit summary if reverted
  • Superset dashboard that allows for querying of the data. This will enable stakeholders to identify gaps in patrolling and assess opportunities for building interventions or identifying potential new moderators.

The above should be largely stable though will likely shift a little as progress is made on measurement and more feedback is received. We have also discussed article-related features such as quality, importance, and topic. These are not prioritized for now.

Draft methods

Background
This hypothesis emerges from the wider framework at T384860 and is aligned with work at T384600, where several metrics of crowdsourced content moderation based on article maintenance template usage were proposed. Our intention is to add in measurement of patrolling to expand our ability to understand moderation and provide insights to guide the development of an on-wiki moderator central place. Some examples of questions that can help us in decisions around what to include in the dataset:

  • How many editors have reverted at least Y edits in 30 days?
  • How many editors have added a messagebox to an article?
  • How many editors provided feedback when rolling back an edit?

Details

Due Date
Jun 30 2025, 4:00 AM

Event Timeline

Isaac triaged this task as High priority.Apr 17 2025, 3:51 PM

FYI T348863: Baseline: Size of content moderation backlog - FlaggedRevs by KC also has some notebooks for patrolling data that might be of use.

Isaac renamed this task from Metrics on Wikipedia Patrolling Work to [WE1.5.4] Wikipedia Patrolling Measurement.Apr 25 2025, 3:09 PM
Isaac added a project: OKR-Work.
Isaac updated the task description. (Show Details)
Isaac set Due Date to Jun 30 2025, 4:00 AM.

A thought: we might also consider adding editing interface as a facet of the edit in case certain decisions around gaps need to take that into consideration -- e.g., like how Editing works with VE so has that as their focus (e.g., T354303). This should be pretty easy to add in later so not something we need to decide now.

Below is the code that I've used in the past to classify based on the edit tags associated with an edit (from mediawiki_history via a UDF). The main effort is expanding the tool tags to be more comprehensive of external tools that folks use instead of the core editing interfaces. In the past, I have just looked through Special:Tags for a language edition and identified any that seem to indicate that someone was using an external tool to make the edit. It's not perfect though as not all tools have an edit tag, so we might also consider edit-summary patterns to look for such as certain hashtags or links to tool documentation pages. Even then it won't be perfect as I'm sure there are tools that don't give any indication when they're used for an edit but hopefully we'd identify enough that the data would be useful.

# this was for French so would need to be updated for more languages
tool_tags = set(['HotCats', 'AWB', 'WPCleaner', 'RenommageCategorie', 'BandeauxPortails', 'contenttranslation', 'BandeauxEbauches', 'PublierBrouillon', 'PaStec'])

def edit_tags_to_interface(edit_tags):
    interface = 'source-wikitext'    
    if 'visualeditor' in edit_tags:
        interface = 'visual-rich'
    elif 'visualeditor-wikitext' in edit_tags:
        interface = 'visual-wikitext'
    else:
        for t in edit_tags:
            if t in tool_tags:
                interface = 'tool'
    
    platform = 'desktop'
    if 'ios app edit' in edit_tags:
        platform = 'iOS'
    elif 'android app edit' in edit_tags:
        platform = 'Android'
    elif 'mobile edit' in edit_tags or 'mobile web edit' in edit_tags:
        platform = 'mobile'
        
    return f'{platform}-{interface}'
Isaac renamed this task from [WE1.5.4] Wikipedia Patrolling Measurement to [WE1.5.3] Wikipedia Patrolling Measurement.Apr 25 2025, 6:46 PM

Progress update on the hypothesis for the week (last week)
Started working on a notebook to create the dataset of edits in March 2025 with metadata, including mediawiki_history fields and patrolling information and status (prevented, delete, reverted, reviewed, edited_over, autopatrolled).

Hypothesis lifecycle stage
Research and Discovery

Any updates on metrics related to this hypothesis
No

Any emerging blockers or risks
Not a blocker but worth sharing. I am considering adding revert risk predictions of revisions from March 2025, but the Risk Observatory table only contains predictions until February 2025. I contacted Fabian, who explained to me that the revert risk base features have migrated to the new content diff dataset, but the risk observatory is still using the now deprecated wikidiff dataset (which is only available until 2025-02). A merge request has been made to re-run the March snapshot and predictions are expected to be available mid next week.

Any unresolved dependencies
No

Have there been any new lessons from the hypothesis?
I explored ways to integrate PageTriage data into this dataset, but linking logs to individual revisions proved challenging due to the variety of PageTriage log types and the differing motivations behind subsequent edits to the corresponding pages.

Have there been any changes to the hypothesis scope or timeline?
No

Progress update on the hypothesis for the week
Advanced the notebook to create the dataset of edits in March 2025. The dataset has been expanded to include detailed information on reverting editors and predicted revert risk scores, enabling answering question of reverting activity (see updates on metrics below). Furthermore, the notebook has been re-run to generate an analogous dataset for edits in October 2024, for which metadata on article maintenance templates will be available (see task T384600).

Hypothesis lifecycle stage
Research and Discovery

Any updates on metrics related to this hypothesis
A exploration notebook is already available providing feedback to the proposed question: How many editors have reverted at least Y edits in 30 days?

Any emerging blockers or risks
No

Any unresolved dependencies
No

Have there been any new lessons from the hypothesis?
No

Have there been any changes to the hypothesis scope or timeline?
No

Progress update on the hypothesis for the week
Continued development on the data collection notebook. For the October 2024 snapshot, metadata now includes information on addition and removal of article maintenance templates. For instance, English Wikipedia revision 1248733993 reflects this update with a new column indicating templates changes as mbox:more footnotes-add | mbox:no footnotes-remove.

Hypothesis lifecycle stage
Research and Discovery

Any updates on metrics related to this hypothesis
The data exploration notebook has been also expanded accordingly. First, it show the distribution of revisions per wiki based on the number of links to Wikipedia project namespaces found in edit summaries. This metric will be used to approximate the question: How many editors provided feedback when rolling back an edit?. In addition, the notebook already includes data addressing the question: How many editors have added a messagebox to an article?.

Any emerging blockers or risks
No

Any unresolved dependencies
No

Have there been any new lessons from the hypothesis?
I initially planned to extract policy mentions in edit summaries using a existing notebook, which focused on English Wikipedia. To overcome this language limitation, a new approach has been developed leveraging the language-agnostic MediaWiki tokenizer (mwtokenizer) and and a library that help map links to namespaces (mwconstants). The availability of these two libraries significantly enhances our capability to process multilingual data efficiently. :wikilove: to @Isaac, Martin Gerlach, Nazia Tasnim, and Aisha Khatun for their contributions.

Have there been any changes to the hypothesis scope or timeline?
No

Progress update on the hypothesis for the week
This week has been relatively lighter, as I have been OoO for two days. That said, I updated the data collection notebook to incorporate revision tags and the comment of the reverting revision for those who are reverted. Progress with the data has been reviewed with @Isaac during our 1:1 meeting. In addition, two other meetings have been scheduled with stakeholders as the dataset is expected to be used on an Product Analytics + Moderators Tools teams's effort to measure current moderator activity to inform a Key Result target for WE 1.3 FY 25/26.

Hypothesis lifecycle stage
Research and Discovery

Any updates on metrics related to this hypothesis
The data exploration notebook has been updated to now include the answers to the three questions using the October 2024 dataset:

  • How many editors have reverted at least Y edits in 30 days?
  • How many editors have added a messagebox to an article?
  • How many editors provided feedback when rolling back an edit?

Any emerging blockers or risks
No

Any unresolved dependencies
No

Have there been any new lessons from the hypothesis?
No

Have there been any changes to the hypothesis scope or timeline?
No

Progress update on the hypothesis for the week
This week’s work focused on building a Superset dashboard to deliver the dataset, which also included modifications to the existing notebooks, e.g., moving the parsing of edit summaries (to extract links to the project namespace) into the data collection phase. The dashboard will be reviewed next week in meetings with colleagues from the Product Analytics and Moderator Tools teams.

Hypothesis lifecycle stage
Research and Discovery

Any updates on metrics related to this hypothesis
No

Any emerging blockers or risks
No

Any unresolved dependencies
No

Have there been any new lessons from the hypothesis?
The dashboard visualization of the number of editors in October 2024 introducing a messagebox to an article shows values for the following wikis: enwiki, eswiki, jawiki, ruwiki, svwiki, and zhwiki. In contrast, no data is shown for dewiki, frwiki, itwiki, nlwiki, and plwiki. This absence was somewhat expected for dewiki, given its outlier condition already reported here. For the other wikis, the missing data appears to come from an unresolved issue related to the HTML rendering of messageboxes, as documented here. Work is in progress to mitigate this issue.

The dashboard visualization of revisions by wiki and status reveals a notably high percentage of unreviewed revisions in the Swedish Wikipedia. I shared this observation with colleagues on the Research team during our weekly Moderators check-In call. @TAndic noted that the Swedish Wikipedia experienced a massive deletions in robot-made articles. While there is no direct evidence linking these two phenomena, this could be worth a closer look.

Have there been any changes to the hypothesis scope or timeline?
No

Progress update on the hypothesis for the week
This week's efforts primarily focused on interactions with the Moderator Tools and Product Analytics teams. During a joint meeting, I presented the current status of the dataset and dashboard, and gathered feedback (notes). Following the session, both teams engaged further to request additional details, particularly in relation to defining baselines for the projected increase in moderation actions for FY 25/26 WE 1.3 KR (T396493). In parallel, while awaiting input from stakeholders regarding specific product interventions to be tracked using this data, I began drafting a report for this project to document the dataset and a summary of findings.

Hypothesis lifecycle stage
Research and Discovery

Any updates on metrics related to this hypothesis
As noted in the previous update, an unresolved issue concerning the HTML rendering of message boxes may have impacted the metrics related to the usage of article maintenance templates. As recently shown here, I have tested a revised parsing approach that appears to successfully capture message boxes on wikis previously affected by this limitation (frwiki, itwiki, nlwiki, and plwiki).

Any emerging blockers or risks
No

Any unresolved dependencies
No

Have there been any new lessons from the hypothesis?
Conversations this week suggest that teams are not yet at a point where specific product interventions have been defined. This uncertainty may support the approach taken in this hypothesis: to build a maximal dataset that includes information on multiple forms of content moderation.

Have there been any changes to the hypothesis scope or timeline?
No

Progress update on the hypothesis for the week
This week’s work focused on two main areas. First, significant effort was dedicated to reviewing and refining both the dataset and the dashboard. This included adding a new table with metrics suggested by the Moderator Tools team. Second, a preliminary version of the report was completed, documenting the dataset, the dashboard, and key findings. The report has been shared with the stakeholders for feedback.

Hypothesis lifecycle stage
Research and Discovery

Any updates on metrics related to this hypothesis
No

Any emerging blockers or risks
No

Any unresolved dependencies
No

Have there been any new lessons from the hypothesis?
No

Have there been any changes to the hypothesis scope or timeline?
No

Progress update on the hypothesis for the week
The report has been updated to incorporate feedback from various stakeholders. The hypothesis said:

If we develop heuristics-based data pipelines for measuring patrolling activity on Wikipedia, we can prototype a model to detect moderation gaps at scale.

To test this, data pipelines were developed to generate the dataset, which has been made accessible via a Superset dashboard, as specified here. The analysis with the dashboard has revealed moderation gaps, which have been shared with stakeholders and documented in the revised report. The dataset is expected to be used to provide data on the retention rate of patrollers using the FlaggedRevs and reverting editors at T396493 as a comparable moderator retention rate metric to inform targets. Furthermore, additional opportunities to leverage this data have been identified at T398071. As a consequence, the hypothesis is supported.

Hypothesis lifecycle stage
Research and Discovery

Any updates on metrics related to this hypothesis
No

Any emerging blockers or risks
No

Any unresolved dependencies
No

Have there been any new lessons from the hypothesis?
Major lessons contained in the report are:

  • If data on article maintenance template usage is needed for additional time periods, it will be necessary that Data Platform Engineering take ownership of the productionization process (T380874).
  • Consistent with the Q3 analysis of article maintenance template practices, editors with high edit counts tend to be more actively engaged in patrolling (through reverts in this case).
  • Self-reviewed and autopatrolled revisions generally exhibit very low revert risk.
  • Reviewed revisions exhibiting high risk scores could present an opportunity to retrain revert risk models.
  • 68% of revisions from October 2024 in Swedish Wikipedia remain unreviewed, although these revisions tend to have low revert risk scores.
  • A noticeable percentage of unreviewed revisions with elevated revert risk scores exist on German and Polish Wikipedia.

Have there been any changes to the hypothesis scope or timeline?
No