The semantic search's Q&A use case requires a set of (natural language question, Wikipedia article segment) pairs as input.
We currently have 2 approaches in mind:
- question-to-question - machine-generated, see demo
- real-world search queries - user-generated
In this spike, we should explore available Google datasets of user search queries tied to URLs of Wikimedia projects.
The main goal is to understand whether we can extract enough fine-grained data, at least at the section level, i.e., (query, URL with section fragment) pairs.