In T406207: Create a dataset for evaluation of search on Wikipedia we generated a benchmark dataset for offline evaluation of search.
The goal of this task is to use the benchmark dataset to evaluate 2 or more search models:
- current Wikipedia search as a baseline
- one of the semantic search prototypes
- (optional) other models