FY25-26 WE3.1.6: If we produce a prototype for in-article Q&A, delivered as a demo interface, then the Reader teams will be able to qualitatively evaluate the approach performs across different user journeys and surface gaps or opportunities for further iteration.
- Starting on a small dataset with 10 articles (2 per quality class.):
- Generate questions/answers by using at least two LLMs. Answers are only for evaluation and making sure the questions are relevant.
- Develop a ranking strategy.
- Pick max top 5 questions/answers per article.
- Develop a strategy for correctness checks/evaluation and run on the dataset.
- Iterate prompts based on the small-dataset results. Human annotation could be useful.
- Enlarging the experiment on a large dataset that is stratified random sample of 500 articles from English Wikipedia:
- Sampling method should account for:
- content length diversity
- topic diversity
- content age diversity
- content quality diversity
- Generate questions on the larger dataset and the selected LLMs from the previous iteration.
- Share scores.
- Prototype interface:
- Allows people to select an article from the predefined list (shown as a dropdown menu, as a search bar with auto-complete, or something else)
- Fetches the top 3 questions (ranked and filtered) for each article
- Shows answers from semantic search. We have switched to adding a button to the prototype UI for consistency.
- Displays questions and results in a table







