Page MenuHomePhabricator

Get search results from semantic search using MIRACL benchmark dataset
Closed, ResolvedPublic

Description

In {T414795#11658502} we obtained results for the semantic search model using our benchmark dataset. However, this setup is restricted to English Wikipedia. In order to understand the performance of our semantic search model in other languages, we will use the dev-set of the MIRACL benchmark data. This data contains queries (natural language questions) and annotated results of text passages from Wikipedia articles in EN, DE, ES, FR, ID (from the languages we are currently considering).
As an example, for English Wikipedia:

While the dataset has some limitations (only natural language queries, different corpus, some years ago), it will still help us to gain insights about:

  • performance of our semantic search model across 5 languages. Does model performance remain stable when considering other languages?
  • performance of our semantic search model in comparison to SOTA models. Previous works have developed passage-retrieval models and evaluated them on the MIRACL corpus reporting nDCG@10 on the dev-sets (for example this paper). Evaluating our model in the same setup will help us assess how much room for improvement there is with existing approaches.

The goal of the task

  • prepare corpus and queries (needs to be consistent with modeling pipeline)
  • generate embeddings of the MIRACL corpus in 5 languages using the qwen-3-0.6B model
  • generate top-10 results of the queries in the dev set in 5 languages (see hdfs:///user/dcausse/T419409-miracl/query_result_pairs).
  • calculate offline evaluation metrics

Ideally, we would like to re-use as much of the pipeline from the previous experiments run by @dcausse - please advise how to best use the existing pipeline for this task.
In principle, the only thing that changes in comparison to T417242: Get search results for queries from benchmark dataset for semantic search model is the underlying corpus of text passages. However, for the MIRACL benchmark data, we will take the pre-processed set of passages.

Event Timeline

Hi @dcausse!

I have prepared the corpus and queries for MIRACL dataset for EN, DE, ES, FR, ID.

You can find the corpus here: /user/trokhymovych/mteb/MIRACLRetrieval -> partitioned by snapshot and wiki as in the modeling pipeline.
Queries are saved here: /user/trokhymovych/mteb/queries -> it includes two columns: query (corresponds to the query text) and language (language code)

I hope it works. Please let me know if any changes are needed or if you have any questions.
Thank you!

CC: @MGerlach

@Trokhymovych thanks! I don't have access to these folders, could update the perms or possibly upload them to /user/dcausse/T419409-miracl if you don't want to open the perms on your user folder?

Hi @dcausse! I have moved the /mteb folder to /user/dcausse/T419409-miracl, so you should have access now.
Please let me know if it works. Sorry for the initial inconvenience.

Thank you!

@Trokhymovych it's working well now, no problem, thanks for the data!

Started the job at https://yarn.wikimedia.org/cluster/app/application_1773845446826_8538 the dataset is quite big and probably will take quite some time to complete (hopefully finished early next week).

Unfortunately the job failed a couple times last week and embeddings extraction just finished last night, recording some timings (1000cores with llama and qwen3-0.6B-Q8):

  • de: 15.8M, 19h
  • en: 32.8M, 28h
  • fr: 14.6M 13h
  • id: 1.4M 2h
  • es: 10.3M 11h

noting a big bias in number of passages, id is an order of magnitude smaller than the others and thus might benefit from less noise when running vector search.

@MGerlach @Trokhymovych query result pairs should be available in hdfs:///user/dcausse/T419409-miracl/query_result_pairs, files are named pure_knn_10_miracl_$lang.json.

Hi @dcausse! Thank you for the results!

I have one clarification question:

  • Would it be possible to obtain the results while allowing multiple passages per article?

I understand that this differs from the Wikipedia search configuration, which returns only the single best passage per article. However, for this particular experiment, having multiple passages is necessary to enable a fair comparison with other systems from the literature on this dataset.

Thanks again for the help.
CC: @Miriam

@Trokhymovych sure, please find those in hdfs:///user/dcausse/T419409-miracl/query_result_pairs with names pure_knn_10_top_10_per_article_miracl_$lang.json.
I collected the top-10 passages per article such that in the worst case (if best matches all belong to the same page you should have them). If you flatten the passage arrays re-ordering by score you should be able to infer the top-10 passages per query. Please let me know if you spot issues and I can try re-shape my query and output format to ignore the per-article breakdown if problematic.

@Trokhymovych is there anything left to be done on this task?

Work was finished. Thanks!