Page MenuHomePhabricator

Run evaluation of 2 or more search models using benchmark dataset
Open, Needs TriagePublic

Description

In T406207: Create a dataset for evaluation of search on Wikipedia we generated a benchmark dataset for offline evaluation of search.
The goal of this task is to use the benchmark dataset to evaluate 2 or more search models:

  • current Wikipedia search as a baseline
  • one of the semantic search prototypes
  • (optional) other models

Details

Other Assignee
Trokhymovych

Event Timeline

weekly update:

weekly update:

  • confirmed with Jazmin that this should be captured as a hypothesis under WE3.10
  • as we have collected the search result relevance annotation, I am starting to think about the best approach to evaluation.
    • metric: likely, we will use nDCG@10 as this is the main metrics in some of the retrieval benchmarks such as MTEB https://arxiv.org/pdf/2210.07316
    • coordinating with Search to make sure our approach is meaningful

weekly update:

weekly update:

  • We refined the set of metrics for evaluation: nDCG@k, precision@k, recall@k, MAP@k, bpref@k for both paragraph and article level
  • We collected search results for the 600 queries of the benchmark dataset for Wikipedia search and the semantic search MVP (qwen-model) with different variations (adding re-ranker, additional context) T417242#11636952
  • Next step: calculating metrics

weekly update

  • Compiled results for the offline evaluation of semantic search in English Wikipedia using the benchmark dataset and comparing it with our current lexical search.
  • Specifically, we evaluate different search models on the new benchmark dataset. We consider Wikipedia search (lexical search) and semantic search (the current model for the MVP) with different variations (e.g. re-ranking results after retrieval). For each model, we get the top-10 search results for each query. We calculate different evaluation metrics to quantify the relevance of the search results using the pytrec_eval package: NDCG, Prevision, Recall, Binary preference (bpref). We evaluate the relevance of the retrieved results on the article and the paragraph level by comparing with the annotations in the benchmark dataset.
  • Results can be found in this doc: https://docs.google.com/document/d/1xgdzD0TFIqyAw45mf9uHjzdpeMauugMRefQBlEu8i6I/edit?tab=t.x3x7obtlsqmn

Main takeaways:

  • Semantic search provides a substantial improvement to lexical search only for long queries (8+ words) and natural language questions. Note that this subset of queries constitutes only around 5% of queries from our users currently.
  • Semantic search does not provide better search results than Wikipedia search for the majority of queries
    • For article-level retrieval, Wikipedia search is better than semantic search for short and medium-length queries, which make up ~80% of all queries.
    • For paragraph-level retrieval, Wikipedia search performs similar to semantic search for short queries (56% of queries) and semantic search only provides small improvements for medium-length queries (25% queries).
  • IMO, these results suggest that we should come up with a query routing strategy; e.g. short&medium length queries --> lexical search, long queries + natural language questions --> semantic search.

Some additional things to capture:

  • Different metrics yield qualitatively similar results across models. Thus, we can use simpler, but more interpretable metrics such as recall@5 (i.e.,the fraction of relevant items that were successfully retrieved in the first 5 results).
  • The semantic search model can be improved by adding a re-ranker yielding a moderate increase of up to 5pp. Other re-ranking strategies might allow for further improvements.

weekly update:

  • Collecting feedback from the first round of analysis for digging deeper.
  • We identified different models for testing different variations for the semantic search to assess whether observation from the first round are due to the specific underlying model (qwen-3-0.6B) or hold generally for semantic search:
  • We identified a multilingual benchmark dataset to i) test variation across languages; and ii) compare our current model's performance with results reported in literature
    • MIRACL is available in EN, DE, ES, FR, ID (but not in IT, NL, PT) from the languages that relevant for the current semantic search project
    • it only covers natural language questions (not from actual search-logs so not representative of WP queries), but our first round of results indicated that these are the queries where semantic search offers the most advantage to our current search.

Weekly update: