End goal
Write a report recommending at least one AI model that we can use for further tuning towards strategic product investments
Steps
- Collect Evaluation Data for 2 or more high-priority AI use-cases T377423
- Evaluate the accuracy of 3 large language models on said use-cases T377425
- Build baselines to compare large language models T380569
- Stress-test our infrastructure to understand what models we can host internally T377496
- Write a final report
Scope for this task
- Consider use-cases that are high-priority based on the AI Strategy work T340693. These include
- Automatic Article Categorization,
- NPOV detection
- Peacock Detection
- Test on at least 10 languages other than English
- Test on a variety of Wikipedia article topics
- Test model performance on the existing internal infrastructure, and model accuracy on external services.
Desired output
At the delivery time, we should be able to recommend what AI model(s) we can use in upcoming product AI features, based on accuracy and performance experiments.
Relevant code and notebooks will be made available in the llm_evaluation gitlab repo.
Confirmed list of direct contributors
@Miriam as the hypothesis owner
@diego as the delegate
@Aitolkyn will create datasets and baselines
@Trokhymovych will run the evaluation steps
@fkaelin and @MunizaA will work on evaluating performance and infrastructure constraints
Confirmed list of folks available for consulting
@MGerlach and @Isaac on models/data
@isarantopoulos @AikoChou and/or @klausman on ML infrastructure
@Aroraakhil on experimental protocol