In T414569: [SPIKE] Explore search query datasets for Q&A question generation we identified a private Google dataset as a potential fit for sourcing natural language questions.
This spike has 2 goals:
- tell whether section-level Wikipedia URLs are good enough to serve as answers to the paired queries
- understand coverage of target Wikipedias - English, but ideally also Arabic, Chinese, French, Indonesian, and Vietnamese
Tasks
- manually evaluate a sample of (query, URL) pairs
- compare URLs with vector search results - run a paired query against our vector search prototype, intercept the paired URL, and grab the matching text snippet
- coverage: for how many articles in a given Wikipedia we don't have queries?
- general shape:
- how many queries do we lose constraining to natural language ones?
- how many queries do we lose constraining to section-level URLs?