Page MenuHomePhabricator

Vector Search PoC
Closed, ResolvedPublic

Description

We would like to get a better understanding of vector search capabilities offered by OpenSearch.

The goal of this task is to spike a PoC, running locally on a small wiki that integrates OpenSearch with the outlink model developed by Research.

Some questions we want to answer:

  • What will mappings look like for docs?
  • Is vector search part of the vanilla OpenSearch, or would we need additional plugins?
  • What would a query look like? How do we retrieve embeddings from?
  • Update relforge indices with and embeddings field to enable vector search
  • How does vector search compare to "more like" ?
    • Can we leverage LLMs to compare sets of recommendations?

WIP code to support his work is available at: https://gitlab.wikimedia.org/gmodena/vector_search

Event Timeline

gmodena updated Other Assignee, added: dcausse.
gmodena renamed this task from [NEEDS GROOMING] Vector Search PoC to Vector Search PoC.Mar 25 2025, 9:22 PM
gmodena updated the task description. (Show Details)

In this spike we wanted to answer the following questions:

What will mappings look like for docs?

index.py contains an example mapping for 50-dimensional embeddings, and HNSW
search. See https://gitlab.wikimedia.org/gmodena/vector_search/-/tree/main?ref_type=heads#embeddings as an example

Is vector search part of the vanilla OpenSearch, or would we need additional plugins?

Vector search is part of vanilla opensearch as of 2.x. 1.x (the one we currently target) requires
the knn plugin.

What would a query look like? How do we retrieve embeddings from?

OpenSearch provides a Neural Search API that can map tokens to embeddings, but this API is not available in version 1.x.
In 1.x, we would need to run inference with an embedding model ourselves before storing and searching embeddings.

How does vector search compare to "more like" ?
Can we leverage LLMs to compare sets of recommendations?

We explore both these questions in the LLM judge section in https://gitlab.wikimedia.org/gmodena/vector_search/-/tree/main?ref_type=heads#experiment

Unfortunately we have not yet been able to test on relforge, but once the KNN plugins is available it should not be too hard to adjust the
indexing and query scripts.

Gehel triaged this task as Medium priority.Mar 28 2025, 2:38 PM

@gmodena thanks!

I started to look into the mapping config and stumbled on this:

"space_type": "cosinesimil", # TODO: does it make sense for article topic vectors?

Cosine similarity will ignore vector norms and I'm wondering if in that case we should not rather normalize the vectors at index time and use the inner product (allowing us to use faiss instead of the deprecated nsmlib).
Looking at the articletopic embeddings they don't appear to be normalized so I'm wondering what kind of information we're losing here and what would be the difference to use inner product on the raw vectors.

Relforge is now ready, I'll try to import the embeddings in some indices using your tool with some variation of the mapping to see if we can spot any difference.

@gmodena thanks!

I started to look into the mapping config and stumbled on this:

"space_type": "cosinesimil", # TODO: does it make sense for article topic vectors?

Cosine similarity will ignore vector norms and I'm wondering if in that case we should not rather normalize the vectors at index time and use the inner product (allowing us to use faiss instead of the deprecated nsmlib).
Looking at the articletopic embeddings they don't appear to be normalized so I'm wondering what kind of information we're losing here and what would be the difference to use inner product on the raw vectors.

In the few experiment's I've run, I did not notice any significant difference in results with different metrics. But it's really hard to draw conclusion from just a couple of examples.
Thinking out loud about normalization:

the current setup should capture both direction and magnitude of the vectors, and magnitude can sometimes encode additional information. I'm unsure about
how relevant (if at all) this would be in the context of article topics embeddings.
Switching to normalized vectors would drop that signal and should focus only on direction — e.g., semantic similarity regardless of how “strong” the embedding is.

Does this track?

Relforge is now ready, I'll try to import the embeddings in some indices using your tool with some variation of the mapping to see if we can spot any difference.

Cool! Let me know if I can help.

Change #1135430 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] opensearch: allow setting LD_LIBRARY_PATH

https://gerrit.wikimedia.org/r/1135430

Change #1135441 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] cirrussearch: enable knn native lib

https://gerrit.wikimedia.org/r/1135441

Change #1135430 merged by Btullis:

[operations/puppet@production] opensearch: allow setting LD_LIBRARY_PATH

https://gerrit.wikimedia.org/r/1135430

Wrote a small demo available on stat1009, you need a tunnel there with ssh -L12222:localhost:12222 stat1009.eqiad.wmnet and then open http://localhost:12222/.

Inner product on raw vectors is completely off the rails (I kept it in the demo just out of curiosity but it makes no sense). I'd be curious to understand why...
nsmlib appears faster eventhough that's one being deprecated... I'd be curious to test the lucene engine but that is only available in opensearch2.
My understanding is that native knn libs like faiss are run on a postfilter approach while the lucene engine might allow pre-filtering.

The stat API reports a mem usage for the 3 vector fields on the 3 indices of 2.5G to 6.2G per node:

curl -s localhost:9200/_plugins/_knn/*/stats/graph_memory_usage_percentage,graph_memory_usage,indices_in_cache | jq .

{
  "_nodes": {
    "total": 4,
    "successful": 4,
    "failed": 0
  },
  "cluster_name": "relforge-eqiad",
  "nodes": {
    "pbxphvgwTgWKZO4nA2w7dg": {
      "indices_in_cache": {
        "gmodena_enwiki_content_20250321": {
          "graph_memory_usage_percentage": 7.6853685,
          "graph_memory_usage": 3850782,
          "graph_count": 696
        },
        "gmodena_frwiki_content_20250321": {
          "graph_memory_usage_percentage": 3.1588252,
          "graph_memory_usage": 1582741,
          "graph_count": 309
        },
        "gmodena_itwiki_content_20250321": {
          "graph_memory_usage_percentage": 1.6908349,
          "graph_memory_usage": 847199,
          "graph_count": 261
        }
      },
      "graph_memory_usage_percentage": 12.535028,
      "graph_memory_usage": 6280722
    },
    "i4UJCIDrRAGqQ4bRzqCltw": {
      "graph_memory_usage": 5401842,
      "indices_in_cache": {
        "gmodena_enwiki_content_20250321": {
          "graph_memory_usage": 3368414,
          "graph_memory_usage_percentage": 6.722661,
          "graph_count": 609
        },
        "gmodena_itwiki_content_20250321": {
          "graph_memory_usage": 847709,
          "graph_memory_usage_percentage": 1.6918526,
          "graph_count": 243
        },
        "gmodena_frwiki_content_20250321": {
          "graph_memory_usage": 1185719,
          "graph_memory_usage_percentage": 2.366451,
          "graph_count": 240
        }
      },
      "graph_memory_usage_percentage": 10.780965
    },
    "-i4KxzEaT1uk46rhLcX45A": {
      "graph_memory_usage": 3376597,
      "indices_in_cache": {
        "gmodena_enwiki_content_20250321": {
          "graph_memory_usage": 1926679,
          "graph_memory_usage_percentage": 1.6590183,
          "graph_count": 369
        },
        "gmodena_frwiki_content_20250321": {
          "graph_memory_usage": 790710,
          "graph_memory_usage_percentage": 0.6808619,
          "graph_count": 189
        },
        "gmodena_itwiki_content_20250321": {
          "graph_memory_usage": 659208,
          "graph_memory_usage_percentage": 0.5676286,
          "graph_count": 195
        }
      },
      "graph_memory_usage_percentage": 2.9075089
    },
    "OeSFnFn4SKyE3vlcEGHu0g": {
      "graph_memory_usage": 2401924,
      "indices_in_cache": {
        "gmodena_enwiki_content_20250321": {
          "graph_memory_usage": 1441558,
          "graph_memory_usage_percentage": 1.241292,
          "graph_count": 267
        },
        "gmodena_itwiki_content_20250321": {
          "graph_memory_usage": 565362,
          "graph_memory_usage_percentage": 0.48682,
          "graph_count": 156
        },
        "gmodena_frwiki_content_20250321": {
          "graph_memory_usage": 395004,
          "graph_memory_usage_percentage": 0.34012872,
          "graph_count": 87
        }
      },
      "graph_memory_usage_percentage": 2.0682406
    }
  }
}

I've been playing with grouping top results by cluster, this is at http://localhost:12222/clustered. Could be interesting in the context of diversity search.

Change #1135441 merged by Bking:

[operations/puppet@production] cirrussearch: enable knn native lib

https://gerrit.wikimedia.org/r/1135441

@gmodena thanks for working on this!

We discussed about this experiment in our last wed meeting and the conclusions are:

  • the retrieval speed is very interesting compared to morelike, nsmlib: ~40ms, faiss: ~80ms, morelike: ~300ms
  • the quality of suggestions is definitely hard to judge and highly dependent on users intent/needs, we still have a small bias toward morelike as it seems to have wider variety of results: e.g. on a scientific person you might get events, works & theories related to this scientific as opposed to the articletopic embeddings that might prefer other scientific individuals. But we agreed that this should be put in front of users via an A/B test to let them decide.
  • we haven't explored the possibility to combine multiple vectors to see if we can expand the variety of results (e.g. take the page vector and a couple of vectors from its outgoing link, combine them and search)
  • we have few concerns about the relatively high mem usage for a 50d vector
  • we would like to explore what is the impact on a index that is updated in real-time
  • we're interested in knowing what the lucene engine (available in opensearch 2) might bring to us, esp. for a vector field on the existing search indices
  • the use of the articletopic embeddings to diversify search results could be interesting but that brings a lot of questions into how to include this in our training pipeline
  • regarding the use llm to judge different engines we're not yet clear on how to proceed with this and need more discussion