Page MenuHomePhabricator

Q2 FY2025-26 Goal: Generate a list of edit suggestions using machine learning
Open, Needs TriagePublic

Description

Hypothesis

If we prompt readily available GenAI models to generate and rank a set of edit suggestions for a diversified sample of 150 English Wikipedia articles, then we will learn what types of editing tasks these generic models can produce at scale and gain a rough, anecdotal understanding of the usefulness of these suggestions. This early signal will help us assess whether some task types could plausibly be generated at scale with generic models (with or without fine-tuning), or whether they would require more specialized approaches - ultimately helping us validate whether pursuing this "single model many suggestions" direction is worthwhile.

Full project details: Project Doc
Note: This ticket describes an early and learning-focused exploration, not the development of a production feature. This exploration will inform the trajectory of work described in T399611

Deliverables

  1. Spreadsheet of suggestions generated per prompt-model combination
  2. Documentation of process and decisions

Goals

Produce a dataset of tasks generated via an LLM, and gain confidence about:

  • What model(s) are best suited for this kind of workflow
  • What prompt(s) are most effective in generating tasks

Qualitatively evaluate the dataset along the following dimensions:

  • Scale: What kinds of tasks are produced (e.g. copyediting, tone, verifiability, citation formatting, reference reliability, etc.)
  • Scale: How many tasks are produced
  • Viability: Roughly/anecdotally, how good or bad the tasks are
  • Risk: Roughly/anecdotally, what are the risks of this approach

Classify task types based on our qualitative evaluation, starting with the following classes:

  • relatively low-risk tasks that we can generate at scale using this approach (eg. copyedit)
  • nuanced and specialized tasks that require a dedicated model (eg. factual inconsistencies)
  • tasks that fall in between, that could utilize a generic model but will require some kind of fine-tuning on top (eg. Tone Check)

Reporting format

Progress update on the hypothesis for the week, including if something has shipped:

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

Any emerging blockers or risks:

Any unresolved dependencies:

New lessons from the hypothesis:

Changes to the hypothesis scope or timeline:

Event Timeline

Sucheta-Salgaonkar-WMF renamed this task from Q2 FY2025-26 Goal: to Q2 FY2025-26 Goal: Generate a list of edit suggestions using machine learning.Nov 11 2025, 6:37 PM
Sucheta-Salgaonkar-WMF updated the task description. (Show Details)

@OKarakaya-WMF could you please add a weekly summary for me to pull into Asana?

Reporting format
Progress update on the hypothesis for the week, including if something has shipped:

  • We have updated the format of the articles during html to text conversion.
  • We have re-generated the most frequent edit types tables after the updates. (article level suggestions.)
  • We have generated edit types in section level. We share the findings in scratchpad
  • We have started working on creating a dataset for MOS and grammar related edits.

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

Any emerging blockers or risks:

Any unresolved dependencies:

New lessons from the hypothesis:

Changes to the hypothesis scope or timeline:

  • We have organized the phases as Preparing the edit suggestion types, Evaluation + Iteration and Data Access. We share the scope of the phases in the Timeline

Reporting (20/03/2026)

Progress update on the hypothesis for the week, including if something has shipped:

  • We have identified a list of edit suggestion types that we aim to work on. The edit suggestion types are based on frequent edits from users, frequent suggestions from LLMs and suggestions from colleagues/community. We will share it with the editing team.
  • Our first evaluation approach is to use user edits as much as possible. We have created a dataset to validate grammar and MOS (Manual of Style) related edit suggestions. We will use it after we create the related suggestions.
  • We are working on the second evaluation method llm-as-a-judge. We are currently testing it on a small subset of suggestions.
  • We are working on generating edit suggestions based on pre-defined list of edit suggestion types.

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

Any emerging blockers or risks:
N/A
Any unresolved dependencies:
N/A
New lessons from the hypothesis:
N/A
Changes to the hypothesis scope or timeline:
N/A

Reporting (27/03/2026)

Progress update on the hypothesis for the week, including if something has shipped:

  • We have generated pre-defined edit types for MOS and grammar related edits. Qualitatively, we find useful suggestions on grammar, mos and neutrality.
  • We have evaluated pre-defined suggestions by using llm-as-a-judge. Based on a small sample of suggestions (20 suggestions) we have 62% human alignment with the llm-as-a-judge scores. We believe annotating a larger dataset can increase the human alignment and confidence of the llm-as-a-judge scores.
  • We work on evaluation based on edits for grammar edit types. We find similarity based comparison useful. We will expand the evaluation based on edits to calculate precision, recall.

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

Any emerging blockers or risks:
N/A
Any unresolved dependencies:
N/A
New lessons from the hypothesis:
N/A.
Changes to the hypothesis scope or timeline:
N/A