Page MenuHomePhabricator

[SDS 1.2.1 B] Test existing AI models for internal use-cases
Closed, ResolvedPublic

Description

End goal
Write a report recommending at least one AI model that we can use for further tuning towards strategic product investments

Steps

  • Collect Evaluation Data for 2 or more high-priority AI use-cases T377423
  • Evaluate the accuracy of 3 large language models on said use-cases T377425
  • Build baselines to compare large language models T380569
  • Stress-test our infrastructure to understand what models we can host internally T377496
  • Write a final report

Scope for this task

  • Consider use-cases that are high-priority based on the AI Strategy work T340693. These include
    1. Automatic Article Categorization,
    2. NPOV detection
    3. Peacock Detection
  • Test on at least 10 languages other than English
  • Test on a variety of Wikipedia article topics
  • Test model performance on the existing internal infrastructure, and model accuracy on external services.

Desired output
At the delivery time, we should be able to recommend what AI model(s) we can use in upcoming product AI features, based on accuracy and performance experiments.
Relevant code and notebooks will be made available in the llm_evaluation gitlab repo.

Confirmed list of direct contributors
@Miriam as the hypothesis owner
@diego as the delegate
@Aitolkyn will create datasets and baselines
@Trokhymovych will run the evaluation steps
@fkaelin and @MunizaA will work on evaluating performance and infrastructure constraints

Confirmed list of folks available for consulting
@MGerlach and @Isaac on models/data
@isarantopoulos @AikoChou and/or @klausman on ML infrastructure
@Aroraakhil on experimental protocol

Details

Due Date
Jan 3 2025, 12:00 AM

Related Objects

Event Timeline

leila triaged this task as High priority.Oct 17 2024, 3:21 PM

Weekly updates:

  • We defined the languages we are hoping to target: the 23 languages in the AYA23 model (See table below, taken from paper). Note that we might need to reduce the number of languages if resources are limited. In this case, we will use the taxonomy of NLP language resources to define a sample of languages.
CodeLanguageScriptFamilySubgroupingNative speakers
arArabicArabicAfro-AsiaticSemitic380 million
csCzechLatinIndo-EuropeanBalto-Slavic10.7 million
deGermanLatinIndo-EuropeanGermanic95 million
elGreekGreekIndo-EuropeanGraeco-Phrygian13.5 million
enEnglishLatinIndo-EuropeanGermanic500 million
esSpanishLatinIndo-EuropeanItalic500 million
faPersianArabicIndo-EuropeanIranian72 million
frFrenchLatinIndo-EuropeanItalic74 million
heHebrewHebrewAfro-AsiaticSemitic5 million
hiHindiDevanagariIndo-EuropeanIndo-Aryan350 million
idIndonesianLatinAustronesianMalayo-Polynesian43 million
itItalianLatinIndo-EuropeanItalic65 million
jpJapaneseJapaneseJaponicJapanesic120 million
koKoreanHangulKoreanicKorean81 million
nlDutchLatinIndo-EuropeanGermanic25 million
plPolishLatinIndo-EuropeanBalto-Slavic40 million
ptPortugueseLatinIndo-EuropeanItalic230 million
roRomanianLatinIndo-EuropeanItalic25 million
ruRussianCyrillicIndo-EuropeanBalto-Slavic150 million
trTurkishLatinTurkicCommon Turkic84 million
ukUkrainianCyrillicIndo-EuropeanBalto-Slavic33 million
viVietnameseLatinAustroasiaticVietic85 million
zhChineseHan & HantSino-TibetanSinitic1.35 billion
  • We defined and started working on the work streams needed to prove the hypothesis:
    • Collect evaluation data (T377423)
      • Eval data for the Task 1 is already collected and available at aitolkyn/ai_use_cases/categories/sample_articles/seed_articles
    • Evaluate accuracy of AI models (T377425)
      • We drafted an experiment book for Article Categorization, available here
    • Evaluate infrastructure constraints for AI Models (y T377496)
      • Starting next week.
  • Relevant code and notebooks will be made available in the llm_evaluation gitlab repo.
leila set Due Date to Dec 18 2024, 12:00 AM.Nov 5 2024, 6:06 PM
leila subscribed.

context for the due date: Miriam will prepare the report by December 4th. There will likely be needs for iterations that @diego will pick up as Miriam's delegate at that point. Diego will bring the report to the finish line by the due date as relevant.

leila changed Due Date from Dec 18 2024, 12:00 AM to Jan 3 2025, 12:00 AM.

(Moving to In progress b/c we're closing the quarterly lane today. Re-assigning to the new task owner. Updating the deadline per what we agreed outside of the task. Please resolve when done with link to output and other relevant info. Thanks.)

diego updated the task description. (Show Details)