In {T414795#11658502} we obtained results for the semantic search model using our benchmark dataset. However, this setup is restricted to English Wikipedia. In order to understand the performance of our semantic search model in other languages, we will use the dev-set of the MIRACL benchmark data. This data contains queries (natural language questions) and annotated results of text passages from Wikipedia articles in EN, DE, ES, FR, ID (from the languages we are currently considering).
As an example, for English Wikipedia:
- corpus: https://huggingface.co/datasets/miracl/miracl-corpus/viewer/en
- queries: https://huggingface.co/datasets/miracl/miracl/blob/main/miracl-v1.0-en/topics/topics.miracl-v1.0-en-dev.tsv
- annotations: https://huggingface.co/datasets/miracl/miracl/blob/main/miracl-v1.0-en/qrels/qrels.miracl-v1.0-en-dev.tsv
While the dataset has some limitations (only natural language queries, different corpus, some years ago), it will still help us to gain insights about:
- performance of our semantic search model across 5 languages. Does model performance remain stable when considering other languages?
- performance of our semantic search model in comparison to SOTA models. Previous works have developed passage-retrieval models and evaluated them on the MIRACL corpus reporting nDCG@10 on the dev-sets (for example this paper). Evaluating our model in the same setup will help us assess how much room for improvement there is with existing approaches.
The goal of the task
- prepare corpus and queries (needs to be consistent with modeling pipeline)
- generate embeddings of the MIRACL corpus in 5 languages using the qwen-3-0.6B model
- generate top-10 results of the queries in the dev set in 5 languages (see hdfs:///user/dcausse/T419409-miracl/query_result_pairs).
- calculate offline evaluation metrics
Ideally, we would like to re-use as much of the pipeline from the previous experiments run by @dcausse - please advise how to best use the existing pipeline for this task.
In principle, the only thing that changes in comparison to T417242: Get search results for queries from benchmark dataset for semantic search model is the underlying corpus of text passages. However, for the MIRACL benchmark data, we will take the pre-processed set of passages.