Page MenuHomePhabricator

Q2 FY2025-26 Goal: Generate a list of edit suggestions using machine learning
Open, Needs TriagePublic

Description

Hypothesis

If we prompt readily available GenAI models to generate and rank a set of edit suggestions for a diversified sample of 150 English Wikipedia articles, then we will learn what types of editing tasks these generic models can produce at scale and gain a rough, anecdotal understanding of the usefulness of these suggestions. This early signal will help us assess whether some task types could plausibly be generated at scale with generic models (with or without fine-tuning), or whether they would require more specialized approaches - ultimately helping us validate whether pursuing this "single model many suggestions" direction is worthwhile.

Full project details: Project Doc
Note: This ticket describes an early and learning-focused exploration, not the development of a production feature. This exploration will inform the trajectory of work described in T399611

Deliverables

  1. Spreadsheet of suggestions generated per prompt-model combination
  2. Documentation of process and decisions

Goals

Produce a dataset of tasks generated via an LLM, and gain confidence about:

  • What model(s) are best suited for this kind of workflow
  • What prompt(s) are most effective in generating tasks

Qualitatively evaluate the dataset along the following dimensions:

  • Scale: What kinds of tasks are produced (e.g. copyediting, tone, verifiability, citation formatting, reference reliability, etc.)
  • Scale: How many tasks are produced
  • Viability: Roughly/anecdotally, how good or bad the tasks are
  • Risk: Roughly/anecdotally, what are the risks of this approach

Classify task types based on our qualitative evaluation, starting with the following classes:

  • relatively low-risk tasks that we can generate at scale using this approach (eg. copyedit)
  • nuanced and specialized tasks that require a dedicated model (eg. factual inconsistencies)
  • tasks that fall in between, that could utilize a generic model but will require some kind of fine-tuning on top (eg. Tone Check)

Reporting format

Progress update on the hypothesis for the week, including if something has shipped:

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

Any emerging blockers or risks:

Any unresolved dependencies:

New lessons from the hypothesis:

Changes to the hypothesis scope or timeline: