Page MenuHomePhabricator

Get search results for different embedding models from semantic search
Closed, ResolvedPublic13 Estimated Story Points

Description

We obtained search results for the semantic search using the qwen-3-0.6B model T417242: Get search results for queries from benchmark dataset for semantic search model. In order to increase the generalizability of the results, we would like to evaluate different embedding models in the same setup. The task is thus to get the top-10 search results for the queries from the benchmark dataset using the following models:

The additional would be nice to have but are not crucial

Event Timeline

pfischer set the point value for this task to 13.Mar 16 2026, 4:31 PM

Update:

  • jina-embeddings-v5-text-nano is based on eurobert and llama.cpp got it supported just recently and spark-nlp does not support it yet, tried to rebuild spark-nlp with it but faced a blocker, skipping for now but an alternative might be not using spark-nlp and fallback to hf sentence transformers (this requires some adaptation to the current pipeline)
  • pplx-embed-v1-0.6b might be a bit complicated esp. if we want to take benefit from the context model for which I need to adapt the pipeline to emit meaningful batches, I suspect I'll have some questions about how to build the batch to keep a meaning full context
  • started to extract embeddings with multilingual-e5-large-instruct using a quantized (Q8) version I built with llama.cpp tooling and uploaded to /user/analytics-search/spark-nlp/models/multilingual_e5_large_instruct_Q8_0_gguf

Next steps, depending on answers from spark-nlp maintainers, I might try jina-embeddings-v5-text-small instead of nano, for pplx-embed I'm not sure how much effort is required to get the context right but I could fallback to the simpler pplx-embed-v1 model.

Hi @dcausse! Thank you for the update!

I think that pplx-embed-v1 is a good alternative to pplx-embed-context-v1 if it is not feasible to process context.

As for the pplx-embed-context-v1, my experience is that the input should be structured not as list[str], but as list[list[str]]. So, my understanding is that for each page/section, we should group the chunks into the list and encode them as a group. I think it is more feasible to encode sections, as a page might be too long.

Thanks so much @dcausse for this work!

  • jina-embeddings-v5-text-nano is based on eurobert and llama.cpp got it supported just recently and spark-nlp does not support it yet, tried to rebuild spark-nlp with it but faced a blocker, skipping for now but an alternative might be not using spark-nlp and fallback to hf sentence transformers (this requires some adaptation to the current pipeline)

Question how long would it take to modify the pipeline to use hf sentence transformers? Thank you!

Quick update

  • multilingual-e5-large-instruct is done: 53M passages in 29h with 1000cores
  • jina-embeddings-v5-text-small is running (started yesterday and is 40% done)

Currently exploring pplx-embed-context-v1 and might probably go with what @Trokhymovych suggested (combine individual sections rather than the whole page).

@Miriam to answer your question I'm not sure, for a single run in a notebook it might not be too hard (unless I'm hitting a wall trying to scale the job in hadoop). If proven difficult in hadoop someone with access to ml-studio could attempt to extract the embeddings.
For the semantic search "production" pipeline this is different:

  • for offline content embeddings extraction we want a reproducible self-contained pipeline (package the hf model in hdfs so that nothing is downloaded externally)
  • for online query embeddings extraction we need to check with the ML team (I think they expect the model to be vllm friendly)

jina-embeddings-v5-text-small is done: 53M passages in 51.3h with 1000cores

@Trokhymovych I think I have something barely working with pplx-embed-context and now wondering how to best shape the passages.
I have something that group passages belonging to the same section together and I pre-pend the context (title, parent sections, section) to the lead paragraph like this:

[$title]
[$parent_section1]
[$parent_section2]
[$section]

$text

following paragraphs will not contain any info regarding the title/parent sections/section, examples at P90180
Logic looks like this:

def contextual_passages(title: str, passages: List[Mapping]) -> List[List[str]]:
    sections = []
    current_section = []
    head_paragraph = None
    for p in passages:
        if head_paragraph is not None and head_paragraph["section"] == p["section"] and p["parent_sections"] == head_paragraph["parent_sections"]:
            current_section.append(p.text)
        else:
            if current_section:
                sections.append(current_section)
                current_section = []
            
            passage = f"[{title}]"
            if p.parent_sections:
                passage += "\n" + " ".join([f"[{s}]" for s in p.parent_sections])
            if p.section:
                passage += "\n" + f"[{p.section}]"
            passage += f"\n\n{p.text}"
            head_paragraph = p
            current_section.append(passage)
    if current_section:
        sections.append(current_section)
    if len(passages) != sum([len(p) for p in sections]):
        raise ValueError(f"{passages} != {sections}")
    return sections

Would you have suggestions to shape the input differently? thanks!

Hi @dcausse! Thank you for the updates.

As for your question, I think the proposed approach should work. That said, I would recommend including the title and section context (e.g., title, parent sections, and section name) in all paragraphs, not just the first one.

In principle, contextual embeddings are designed to capture this structure implicitly. However, I think that explicitly providing that context can still improve performance, particularly when chunks are retrieved independently.

I should say that this is a kind of decision that can be validated through an experiment on a benchmark dataset if needed.

As for your question, I think the proposed approach should work. That said, I would recommend including the title and section context (e.g., title, parent sections, and section name) in all paragraphs, not just the first one.

Sounds good, I started a job that will re-use the same "context" strategy used by other non-context aware models by pre-pending the title, parent sections and section to the beginning of every passages.
Given the current progress in hadoop I estimate the job to take around 40h so hopefully early next week I can start extracting the query-result pairs for the following three models:

  • multilingual-e5-large-instruct
  • jina-embeddings-v5-text-small
  • pplx-embed-context-v1-0.6b

Couple questions:

  • for pplx-embed-context-v1-0.6b my understanding is that it must be aligned with pplx-embed-v1 and I should use it for query embeddings without any instruction?
  • for multilingual-e5-large-instruct what instruction would you suggest to test, I was going to pick Given a web search query, retrieve relevant passages that answer the query but there seems to be other plausible options, would you have suggestions on this?

Great news, thank you for the update.

As for your questions:

  1. As for pplx-embed-context-v1-0.6b, you should use the same model (pplx-embed-context-v1-0.6b) to calculate query embeddings. Here is the reference for that: official documentation.
  2. I think your suggested instruction is good to go.

Hi @dcausse! I wanted to check in on the progress of extracting the query-result pairs. Could you please share an estimated timeline for when this might be ready?

If you have any questions or need anything from my side, please let me know.

Thank you in advance!

@Trokhymovych I think I'll have the data by the end of day.
I have the embeddings extracted and indexed for

  • pplx-embed-context-v1-0.6b: 53M passage in 36h (1024 cores with beefier workers: 64*4G+16G, 16cores)
  • embbedings-jinaai-embeddings-v5-nano: 53M passages in 19h (1024 cores, workers: 64*4G+8G, 16 cores)

pplx-embed was a bit fiddly and required some time to fine-tune...

I'm re-indexing multilingual-e5-large-instruct because for some reasons the resulting index is smaller so something wrong happened.
Unfortunately I have to drop embbedings-jinaai-embeddings-v5-small embeddings because I forgot to add the "Document: " instruction...

In short I should have the query-result pairs for the three models we initially planned to test:

  • multilingual-e5-large-instruct (int8 quantization)
  • pplx-embed-context-v1-0.6b
  • embbedings-jinaai-embeddings-v5-nano

Regarding the output format do you want to the top 10 paragraphs per page like what we did for the miracl dataset or only the best one?

Great, thank you for the update and for the question.

For this experiment, I need the results aligned with the original system (one chunk per page). It will allow us to have comparable results with previous experiments (Qwen model) on our dataset.

Thank you!

@Trokhymovych please find the query result pairs in /user/dcausse/semantic_search/T419397/query_result_pairs/ (file names should be self-explanatory).
My problem with multilingual-e5-large-instruct was due to its small context size, around 300k passages were affected, I reran those allowing the model to ignore extra tokens.

Recording couple technical points:

  • embbedings-jinaai-embeddings-v5-nano
    • quantization: model defaults (bf16)
    • vector dimension 768, normalized
    • index size: 178.9gb (+/- 10%) across 15 shards
    • vector engine, faiss ondisk, hnsw, l2 space
    • spark setup: (hf transformers API, 1024 cores, workers: 64*4G+8G, 16 cores)
    • time for doc embeddings: 19h
  • multilingual-e5-large-instruct
    • quantization: INT8
    • vector dimension 1024, normalized
    • index size: 231.1gb (+/- 10%) across 15 shards
    • vector engine, faiss ondisk, hnsw, l2 space
    • spark setup: (spark-nlp/llama.cpp, 1000 cores, workers: 125*8G+0G, 8 cores)
    • time for doc embeddings: 29h
  • pplx-embed-context-v1-0.6b
    • quantization: model defaults
    • vector dimension 1024 quantized byte vectors (int8)
    • index size: 68gb (+/- 10%) across 15 shards
    • vector engine, lucene ondisk, hnsw, cosinesim space (faiss does not support cosine space in our version for byte vectors)
    • spark setup: (hf transformers, 1024 cores, workers: 64*4G+16G, 16cores)
    • time for doc embeddings: 36h

hacks required:

  • pplx-embed (hardcode trust_remote_code=True to the tokenizer to avoid a prompt at load time)
diff --git a/modeling.py b/modeling.py
index cf83677..4689c9b 100644
--- a/modeling.py
+++ b/modeling.py
@@ -118,7 +118,7 @@ class PPLXQwen3ContextualModel(PPLXQwen3Model):
                 f"Did you forget to load with trust_remote_code=True?"
             )
 
-        self.tokenizer = AutoTokenizer.from_pretrained(config._name_or_path)
+        self.tokenizer = AutoTokenizer.from_pretrained(config._name_or_path, trust_remote_code=True)
         self._flexible_quantizer = FlexibleQuantizer()
 
     @staticmethod
  • jina-embeddings-v5-nano (add config_class static property to JinaEmbeddingsV5Model, used by hf auto_factory.py when the model code is already loaded):
diff --git a/modeling_jina_embeddings_v5.py b/modeling_jina_embeddings_v5.py
index 123068a..03dcdfb 100644
--- a/modeling_jina_embeddings_v5.py
+++ b/modeling_jina_embeddings_v5.py
@@ -13,6 +13,8 @@ from .modeling_eurobert import EuroBertModel
 
 
 class JinaEmbeddingsV5Model(PeftMixedModel):
+    config_class = JinaEmbeddingsV5Config
+
     @classmethod
     def register_for_auto_class(cls, auto_class="AutoModel"):
         return PreTrainedModel.register_for_auto_class.__func__(cls, auto_class)