FY25-26 WE3.1.6: If we produce a prototype for in-article Q&A, delivered as a demo interface, then the Reader teams will be able to qualitatively evaluate the approach performs across different user journeys and surface gaps or opportunities for further iteration.
- Starting on a small dataset with 10 articles (2 per quality class.):
[x] Generate questions/answers by using at least two LLMs. Answers are only for evaluation and making sure the questions are relevant.
[x] Develop a ranking strategy.
[x] Pick max top 5 questions/answers per article.
[x] Develop a strategy for correctness checks/evaluation and run on the dataset.
[x] Iterate prompts based on the small-dataset results. Human annotation could be useful.
- Enlarging the experiment on a large dataset that is stratified random sample of 500 articles from English Wikipedia:
[x] Sampling method should account for:
1. content length diversity
2. topic diversity
3. content age diversity
4. content quality diversity
[x] Generate questions on the larger dataset and the selected LLMs from the previous iteration.
[ ] Share scores.
- Prototype interface:
[x] Allows people to select an article from the predefined list (shown as a dropdown menu, as a search bar with auto-complete, or something else)
[x] Fetches the top 3 questions (ranked and filtered) for each article
[ ] Shows answers from semantic search.
[x] Displays questions and results in a table