FY25-26 WE3.1.6: If we produce a prototype for in-article Q&A, delivered as a demo interface, then the Reader teams will be able to qualitatively evaluate the approach performs across different user journeys and surface gaps or opportunities for further iteration.
[ ] Starting on a small dataset with 10 articles (2 per quality class.):
- Generate questions/answers by using at least two LLMs. Answers are only for evaluation and making sure the questions are relevant.
- Develop a ranking strategy.
- Develop a strategy for correctness checks/evaluation and run on the dataset.
- Iterate prompts based on the small-dataset results. Human annotation could be useful.
[ ] Enlarging the experiment on a large dataset that is stratified random sample of 500 articles from English Wikipedia:
- Sampling method should account for:
1. content length diversity
2. topic diversity
3. content age diversity
4. content quality diversity
- Generate questions on the larger dataset and the selected LLMs from the previous iteration.
- Run correctness checks and iterate based on results.
[ ] Prototype interface:
- Allows people to select an article from the predefined list (shown as a dropdown menu, as a search bar with auto-complete, or something else)
- Fetches the top 3 questions (ranked and filtered) for each article
- Conducts a search using the method identified.
- Displays questions and results in a table