In order to develop models for improving search, we need a dataset of queries with annotations of relevant results. Since we are currently lacking such a dataset. the goal of this task is to collect such a dataset. For the first version, we will restrict ourselves to English (potential follow-up work could be expansion to other languages, e.g., via translation).
The details of this task still need to be determined.
Things to figure out
- What level of annotation: passage-level/paragraph
- What types of query types (T407603)
- Collect set of queries for the benchmark dataset (T408121)
- Collect candidate search results for queries (T409559)
- Annotate queries with relevant results (T409561)
Additional information:
- An example of how such a dataset could like is Google's natural questions dataset which contains natural language queries from aggregated and anonymized issues by users to the Google search engine. These queries are then annotated with relevant paragraphs from Wikipedia articles containing the answer.