Hypothesis
If we prompt readily available GenAI models to generate and rank a set of edit suggestions for a diversified sample of 150 English Wikipedia articles, then we will learn what types of editing tasks these generic models can produce at scale and gain a rough, anecdotal understanding of the usefulness of these suggestions. This early signal will help us assess whether some task types could plausibly be generated at scale with generic models (with or without fine-tuning), or whether they would require more specialized approaches - ultimately helping us validate whether pursuing this "single model many suggestions" direction is worthwhile.
Full project details: Project Doc
Note: This ticket describes an early and learning-focused exploration, not the development of a production feature. This exploration will inform the trajectory of work described in T399611
Deliverables
- Spreadsheet of suggestions generated per prompt-model combination
- Documentation of process and decisions
Goals
Produce a dataset of tasks generated via an LLM, and gain confidence about:
- What model(s) are best suited for this kind of workflow
- What prompt(s) are most effective in generating tasks
Qualitatively evaluate the dataset along the following dimensions:
- Scale: What kinds of tasks are produced (e.g. copyediting, tone, verifiability, citation formatting, reference reliability, etc.)
- Scale: How many tasks are produced
- Viability: Roughly/anecdotally, how good or bad the tasks are
- Risk: Roughly/anecdotally, what are the risks of this approach
Classify task types based on our qualitative evaluation, starting with the following classes:
- relatively low-risk tasks that we can generate at scale using this approach (eg. copyedit)
- nuanced and specialized tasks that require a dedicated model (eg. factual inconsistencies)
- tasks that fall in between, that could utilize a generic model but will require some kind of fine-tuning on top (eg. Tone Check)
Reporting format
Progress update on the hypothesis for the week, including if something has shipped:
Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):
Any emerging blockers or risks:
Any unresolved dependencies:
New lessons from the hypothesis:
Changes to the hypothesis scope or timeline: