Page MenuHomePhabricator

Get search results for queries from benchmark dataset for semantic search model
Closed, ResolvedPublic

Description

The goal of this task is to run the offline evaluate of the semantic search model developed in T412338: Q2 FY2025-26 Goal: Semantic Search - Embeddings Service for MVP using the benchmark dataset from T406207: Create a dataset for evaluation of search on Wikipedia. For this, we need to get the top-k (k probably 10) search results for each of the 600 queries (see this list). The crucial part is that the model should be trained/indexed with the same fixed corpus that we used for the creation of the benchmark dataset. The snapshot of the corpus is located at /user/trokhymovych/wikimedia_processed_snapshot_20260125.

(stretch goal) Get the top-k search results for other search models.

Event Timeline

Please find query-result pairs at: hdfs:///user/dcausse/semantic_search/T417242/. The folder contains a set of outputs:

  • pure_knn_10.json: the top 10 of the bare vector search
  • rerank_at_3_no_context.json: same as above but the with top-3 re-ranked using the bare passage
  • rerank_at_10_no_context.json: same as above but the top-10 are re-ranked
  • rerank_at_3_full_context.json: top-3 re-ranked using the passage with additional context (title, parent sections and section)
  • rerank_at_10_full_context.json: same as above but the top-10 are re-ranked
  • rerank_at_3_with_lead.json: top-3 re-ranked using the passage with additional context (title, parent sections and section and the lead paragraph of the page if different from the best passage)
  • rerank_at_10_with_lead.json: same as above but the top-10 are re-ranked

@dcausse Thanks for generation this dataset.
We succesfully used this to run the offline evaluation with the benchmark dataset.

Closing as the task is completed.