Page MenuHomePhabricator

Crowdsourced content moderation metrics
Closed, ResolvedPublic

Description

Moderators play a crucial role in sustaining Wikimedia communities by handling social, technical, and governance work. In T376684, the WMF Research team identified a large number of moderation actions across language editions of Wikipedia. Among them, some focus on content rather than users and are crowdsourced, involving a large number of users without high levels of wiki-specific expertise or hard-to-acquired user rights. We refer to them as crowdsourced content moderation.

In coordination with the qualitative work described in T383365, we want to quantitatively analyze crowdsourced content moderation actions, with a particular emphasis on exploring the use of templates added to articles needing cleanup. We aim to understand how article issue templates can effectively draw attention from both readers and editors to engage in the moderation process.

In this task, we will address some of these questions to prototype metrics that can serve Product Teams.

Details

Due Date
Apr 7 2025, 4:00 AM

Event Timeline

Isaac triaged this task as High priority.Jan 23 2025, 3:30 PM
Isaac added a project: Research.
Isaac moved this task from Backlog to In Progress on the Research board.

The Gitlab repository originally created for T377324 has been updated with:

  • Code to generate the dataset of moderation actions in October 2024 on arzwiki, dewiki, enwiki, eswiki, frwiki, itwiki, jawiki, nlwiki, plwiki, ruwiki, svwiki, zhwiki (the notebook expands on @Isaac's original approach).
  • Resulting dataset. It is a sample of one month only, but an expanded version could be utilized for understanding the use of maintenance templates to develop models to support editors (ping @MGerlach).
  • Code with the following metrics:
    • Number of templates added/removed
    • Most common templates added/removed
    • Number of edits with a template added/removed
    • Distribution of edits with a template added/removed by user edit bucket
    • Distribution of edits with a template added/removed by user group
    • Number of editors adding/removing templates
    • Distribution of editors adding/removing templates by user edit bucket
    • Distribution of editors adding/removing templates by user group
    • Number of articles with a template added/removed
    • Most common articles with a template added/removed
Pablo renamed this task from Distributed moderation: metrics to Crowdsourced Content Moderation: data and metrics.Mar 7 2025, 2:15 PM
Pablo renamed this task from Crowdsourced Content Moderation: data and metrics to Crowdsourced content moderation metrics.
Pablo updated the task description. (Show Details)

With this notebook, I have created CSV files of template stats for each wiki at: https://gitlab.wikimedia.org/repos/research/who-are-moderators/-/tree/main/data/templates. The files include the following fields:

  • wiki_db: Database name of the wiki (in this notebook: arzwiki, dewiki, enwiki, eswiki, frwiki, itwiki, jawiki, nlwiki, plwiki, ruwiki, svwiki, zhwiki).
  • snapshot: Timestamp of the dataset (in this notebook: 2024-10).
  • template_name: Name of the template.
  • template_type: Type of template (mbox, inline).
  • template_change: Type of change (add, remove).
  • template_count: Number of times the template was added/removed.
  • revision_count: Number of revisions in which the template was added/removed.
  • page_count: Number of pages where the template was added/removed.
  • page_namespace_0_count: Number of pages in namespace 0 where the template was added/removed.
  • page_namespace_2_count: Number of pages in namespace 2 where the template was added/removed.
  • page_namespace_102_count: Number of pages in namespace 102 where the template was added/removed.
  • page_namespace_118_count: Number of pages in namespace 118 where the template was added/removed.
  • page_namespace_other_count: Number of pages in other namespaces where the template was added/removed.
  • editor_count: Number of editors who added/removed the template.
  • editor_bot_count: Number of bot users who added/removed the template.
  • editor_bot_perc: Percentage of bot users who added/removed the template.
  • editor_sysop_count: Number of sysop users who added/removed the template.
  • editor_sysop_perc: Percentage of sysop users who added/removed the template.
  • editor_editor_count: Number of editors belonging to the editor user group who added/removed the template.
  • editor_editor_perc: Percentage of editors belonging to the editor user group who added/removed the template.
  • editor_patroller_count: Number of patroller users who added/removed the template.
  • editor_patroller_perc: Percentage of patroller users who added/removed the template.
  • editor_with_rights_count: Number of editors belonging to any user group who added/removed the template.
  • editor_with_rights_perc: Percentage of editors belonging to any user group who added/removed the template.
  • editor_without_rights_count: Number of editors not belonging to any user group who added/removed the template.
  • editor_without_rights_perc: Percentage of editors not belonging to any user group who added/removed the template.
  • editor_1_9_count: Number of editors with an edit count between 1 and 9 who added/removed the template.
  • editor_1_9_perc: Percentage of editors with an edit count between 1 and 9 who added/removed the template.
  • editor_10_99_count: Number of editors with an edit count between 10 and 99 who added/removed the template.
  • editor_10_99_perc: Percentage of editors with an edit count between 10 and 99 who added/removed the template.
  • editor_100_999_count: Number of editors with an edit count between 100 and 999 who added/removed the template.
  • editor_100_999_perc: Percentage of editors with an edit count between 100 and 999 who added/removed the template.
  • editor_1000_9999_count: Number of editors with an edit count between 1,000 and 9,999 who added/removed the template.
  • editor_1000_9999_perc: Percentage of editors with an edit count between 1,000 and 9,999 who added/removed the template.
  • editor_10000_inf_count: Number of editors with an edit count greater than 10,000 who added/removed the template.
  • editor_10000_inf_perc: Percentage of editors with an edit count greater than 10,000 who added/removed the template.
  • editor_age_mean: Mean number of years since registration for editors who added/removed the template.
  • editor_age_median: Median number of years since registration for editors who added/removed the template.

All files have been compiled into a spreadsheet to assist @cwylo in generating the taxonomy of templates.

Isaac set Due Date to Apr 7 2025, 4:00 AM.Mar 26 2025, 9:17 PM

In order to assist @cwylo in categorizing templates, a notebook was created to link templates added or removed in a revision to policy invocations in the comment of such revision. A sample of the dataset is shown below.

wiki_dbtemplate_actionpolicycount
enwikimbox:afc submission-removeWP:AFCH549
enwikimbox:afc submission-removeWikipedia:Articles for creation410
enwikimbox:proposed deletion-addWP:PROD324
svwikimbox:robotskapad-removeProject:AWB250
enwikiinline:cn-addWP:RS164
enwikimbox:article for deletion-removeWP:XFDC#4.0.16147
enwikimbox:orphan-removeWikipedia:ORPHAN143
enwikimbox:article for deletion-removeWP:XFDC#4.0.16-beta114
enwikiinline:cn-addWP:UGC94
enwikimbox:orphan-addWP:AWB/T83
eswikimbox:referencias-addWP:TL50
enwikiinline:cn-addWP:PLANESPOTTERS45
enwikimbox:unreferenced-removeWP:URA45
enwikimbox:afc submission-addWP:AFCH44
enwikiinline:cn-addWP:ICTFSOURCES43
eswikimbox:destruir-addWP:Twinkle Lite39
enwikimbox:unreferenced section-removeWP:V35
enwikiinline:cn-addWP:CIRCULAR31
enwikiinline:cn-addWP:SPS30
enwikiinline:citation needed-addWP:UGC25
enwikimbox:uncategorized-addWP:AWB/T22
enwikimbox:one source-removeWP:SORTKEY18
eswikimbox:referencias adicionales-addWP:TL18
eswikimbox:en desarrollo-addWP:TL18
jawikiinline:要出典-addWikipedia:検証可能性#ウィキペディア自身及びウィキペディアの転載サイト18

If you choose to create a separate task for the paper writing, feel free to resolve this one (thanks!). Otherwise, let's keep it open for the next week and then resolve it when the paper is submitted. I did a quick read-through of the paper as of Friday morning my time. A few thoughts:

  • I know you plan to do some clean-up of the paper so I didn't pay too much attention to particulars. One thing: Footnote 8 about HTML dataset. It actually was collected from the APIs by Fabian and not the dumps (because we needed individual revisions not a snapshot in time). That dumps endpoint for the Enterprise Dumps are being deprecated too in favor of their Snapshot APIs, so I'd either link to there (https://enterprise.wikimedia.com/api/) or documentation about Parsoid's APIs (https://www.mediawiki.org/wiki/RESTBase/service_migration#Parsoid_endpoints).
  • My major point would be to motivate why maintenance tagging is important to understand upfront in the Introduction (beyond it not receiving much study). A few potential ideas:
    • You highlight usage of the templates within ML research and it might be worth raising that up -- i.e. these templates are a source of labels for training classification models and so it's important to understand how they're used in practice and whether this tagging extends across many language communities. We also provide a more scalable data collection approach that could be used for those ends.
    • These templates are also used as a source of tasks for editors themselves via recommender systems. You could cite SuggestBot as well as Newcomer Tasks.
    • Highlight the importance of having these templates as a pathway separate from reverting content. This offers Wikipedians the ability to flag issues without reverting edits. This is an important alternative remedy given that reverting edits can have a negative impact on newcomer retention (Rise and decline). You might note too though that other work has not shown that tagging necessarily leads to change in the editors who are being flagged: https://dl.acm.org/doi/abs/10.1145/3274406.
    • I think it's worthwhile to mention the similarity between this particular system and the turn to crowd-sourced moderation on X/Facebook. Just something like: "Given the growing interest in crowd-sourced moderation of this style in social media platforms like X (maybe cite something like https://dl.acm.org/doi/abs/10.1145/3686967) or Facebook (not sure if there's a research paper yet but citing a news article could work), it's important to understand how it works on platforms with a long history of community moderation like Wikipedia."

Thanks, @Isaac! I re-submitted a new version of the manuscript addressing your suggestions, along with some minor edits from @cwylo and me, so I will resolve this task.