###__**Hypothesis**__
If we prompt readily available GenAI models to generate and rank/score a set of edit suggestions for a diversified set of 300 English Wikipedia articles, we will learn what kinds of editing tasks (e.g. copyediting, tone, verifiability, properly formatting references, reference reliability, etc.) a generic model is able to produce suggestions for, and we’ll enable a deeper evaluation of the quality of these suggestions across various subsets of articles.
Full project details: [[ https://docs.google.com/document/d/19tOyArAzCrSbLIiOKRJFwFWc9E9VYTvaQSfEryMNh4k/edit?usp=sharing | Project Doc ]]
Note: This ticket describes an early and learning-focused exploration, not the development of a production feature. This exploration will inform the trajectory of work described in T399611
###__**Deliverables**__
# Spreadsheet of suggestions generated per prompt-model combination
# Documentation
## Why we selected the models we used
## How we arrived at the prompts we used
## What articles we used and why
# (Optional) Demo interface of VE
###__**Reporting format**__
Progress update on the hypothesis for the week, including if something has shipped:
-
Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):
-
Any emerging blockers or risks:
-
Any unresolved dependencies:
-
New lessons from the hypothesis:
-
Changes to the hypothesis scope or timeline:
-