Page MenuHomePhabricator

Q1 FY2025-26 Goal: Task generation engine for Revise Tone task
Open, Needs TriagePublic

Description

Hypothesis

If we develop a task generation engine for the Revise Tone structured task, integrate our recent learnings about which content to include or filter out, and provide pipelines that automatically refresh the task list, we'll enable a qualitative evaluation of the tasks generated and an A/B experiment that tests whether this type of task helps newcomer editors to make more constructive edits.

Scoping details

Use case:

This model will support a new Suggested Edit task that invites contributors—especially newcomers—to improve the neutrality of existing Wikipedia articles by identifying and rewriting biased or promotional language, and peacock language. The intended audience includes users engaging with Suggested Edits via the Newcomer Homepage. The model’s outputs will be surfaced as highlighted sentences or paragraphs within articles, accompanied by calls to action encouraging users to revise them to align with Wikipedia's neutral point of view (NPOV) policy.

This task explores the broader hypothesis that Edit Checks and Suggested Edits can share underlying detection logic. If successful, this approach could improve efficiency, consistency, and scalability across structured editing workflows.

Related tasks:

Model purpose:

The model should analyze article content and detect instances of biased tone or peacock language at a sentence or paragraph level. These detections will inform Suggested Edits, guiding contributors to revise non-neutral phrasing.

Goal:

This project aims to improve article quality by encouraging neutral, policy-aligned contributions. Specific goals include:

  • Increasing the number of constructive Suggested Edits
  • Reducing the burden on moderators by proactively addressing biased language
  • Supporting newcomers in learning and applying Wikipedia’s NPOV guidelines
  • Key success metrics include:
    • Accuracy of model detections (precision/recall)
    • Revert rate and/or qualitative review of resulting edits
    • Completion rate of "neutral tone" Suggested Edits
Prior art:

This project builds on that work by adapting UX for existing Suggested Edits and Edit Checks:

Prioritization details

Timing:

We are hoping to run an experiment in November 2025.

KR impact:

FY25/26 WE1.1 KR:
Increasing newcomer constructive activation and retention:

Increase constructive edits [i] by X% for editors with less than 100 cumulative edits, as measured by experiments by the end of Q2.
i. "Constructive edits" = edits that are not reverted within 48 hours of being published

Other comments

Model requirements:

  • Detection should be precise enough (sentence or paragraph level) to support actionable user suggestions
  • Low false positive rate is essential to maintain user trust and minimize disruption
  • Ideally the suggestion queue is built in a way to allow for Community Configuration (e.g., ability for admins to define rules to exclude certain pages, sections, or words) would improve usefulness and community adoption
  • The model should be efficient and scalable for use across many articles and languages
  • The model should ideally exclude suggestions that target direct quotes, as peacock language or non-neutral tone may be appropriate in these contexts (e.g., when quoting historical texts, public statements, or notable quotations).

Reporting format

Progress update on the hypothesis for the week, including if something has shipped:

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

Any emerging blockers or risks:

Any unresolved dependencies:

New lessons from the hypothesis:

Changes to the hypothesis scope or timeline:

Related Objects

Event Timeline

Weekly Report

Progress update on the hypothesis for the week, including if something has shipped:

  • Cassandra design review with Data Persistence T401021
  • Started working on Revise Tone Task Generator (update-pipeline) in LiftWing T408538 and initial task generation T408533

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

  • N/A

Any emerging blockers or risks:

  • N/A

Any unresolved dependencies:

New lessons from the hypothesis:

  • N/A

Changes to the hypothesis scope or timeline:

  • We'll do a small ingestion for testwiki using mock articles once Cassandra and Data Gateway are ready. This will allow Growth build and test their implementation.
  • The initial ingestion of production data will happen when the update-pipeline in Lift Wing is ready, so that there is no gap where old revisions are not purged.

Weekly Report

Progress update on the hypothesis for the week, including if something has shipped:

  • Cassandra + Data Gateway are ready T401021#11341970
  • Loaded mock data to Staging Cassandra and Search weighted tags, allowing Growth to build and test their implementation T401021#11349324
  • Started working on the Cassandra <--> LiftWing integration T409414
  • Continued work on Initial task generation T408533
  • Continued work on Revise Tone Task Generator T408538

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

  • Growth confirmed that Japanese Wikipedia is not being considered. They're still deciding between English and Portuguese Wikipedia

Any emerging blockers or risks:

  • We realized ChangeProp currently cannot consume mediawiki.page_content_change.v1, which is required by the Revise Tone Task Generator
    • Created T409469 to address this issue

Any unresolved dependencies:

  • N/A

New lessons from the hypothesis:

  • N/A

Changes to the hypothesis scope or timeline:

  • N/A

Weekly Report

Progress update on the hypothesis for the week, including if something has shipped:

  • Initial task generation T408533
    • Testwiki and frwiki dataset are ready
  • Revise Tone Task Generator T408538
    • Deployed on experimental ml-staging
    • Debugging and testing the Cassandra connection
    • Setting up the script for initial ingestion
  • Cassandra <-> LiftWing integration T409414
    • Cassandra role & grants for Lift Wing T409850
    • Pending AQS/Cassandra/ferm: Add ML k8s cluster pod IPs to client list (patch)

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

  • N/A

Any emerging blockers or risks:

  • N/A

Any unresolved dependencies:

New lessons from the hypothesis:

  • N/A

Changes to the hypothesis scope or timeline:

  • We will not meet the November 14th deadline due to ongoing Cassandra and LiftWing integration work and the ChangeProp dependency.

Weekly Report

Progress update on the hypothesis for the week, including if something has shipped:

  • Cassandra <-> LiftWing connection is working
  • Full workflow (Cassandra + LiftWing + Changeprop) tested in staging and works as expected
  • Initial dataset ready for pilot wikis (en, fr, ar, pt)
  • Two remaining tasks:
    • Finishing the script for initial ingestion via LiftWing
    • Finishing the paragraph extraction code for pilot wikis
  • We should be ready to move to production and run initial ingestion early next week.

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

  • N/A

Any emerging blockers or risks:

  • N/A

Any unresolved dependencies:

  • Pod-to-pod communication on LiftWing for topic filtering isn't solved yet, but not blocking moving to production. Details in T408538#11394844

New lessons from the hypothesis:

  • N/A

Changes to the hypothesis scope or timeline:

  • N/A

Weekly Report

Progress update on the hypothesis for the week, including if something has shipped:

  • We shipped the pilot wikis (en, fr, ar, pt) to production!
    • This includes the initial ingestion and update pipeline
    • Articles with tone issues can be found in search enwiki, frwiki, arwiki, and ptwiki
    • The number of available "hasrecommendation:tone" increases in real-time when tone issues are detected in new edits from selected article topics

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

  • N/A

Any emerging blockers or risks:

  • N/A

Any unresolved dependencies:

  • N/A

New lessons from the hypothesis:

  • N/A

Changes to the hypothesis scope or timeline:

  • N/A