Page MenuHomePhabricator

Chunk, trim and generate passage embeddings from enterprise structured content snapshots
Open, Needs TriagePublic8 Estimated Story Points

Description

Prior to indexing a knn field in opensearch we want to extract passage embeddings from the enterprise structured dumps.

For this we need a job that parse the structured content json dumps:

  • flatten the passages out of the sections tree
  • trim some content (large paragraphs, short paragraphs, long lists, all this still up for discussion)

The metadata to keep for every passage should include:

  • the section name
  • the list of parent sections
  • the text
  • the type of passage (paragraph or list item)
  • the index of the passage in the document
  • the depth of the section the passage is extracted from
  • the depth of the list the list item is extracted from
  • the position in the list the list item is extract from

Embeddings have to be extracted using a model and possibly spark-nlp. The text to analyse should include the context in which the passage is extracted a possible approach is:
[Title] [Section 1] [Sub section 1] text
All this is also up for discussion and could be changed if the model context window is large enough.

The model evaluated at the moment is Qwen3 0.6B and for spark-nlp we were able to extract the embeddings in the hadoop cluster using the quantized GGUF version.

To optimize extracting all the embeddings weekly we must make sure to re-use as much as we can from previous runs. Numbers show that for enwiki out of the 70M passages only 0.5M have to be recomputed weekly.

For spark-nlp to avoid having to rely on their model store we will have to download the model manually to hdfs and use AutoGGUFEmbeddings.load("/model_path_in_hdfs") instead of pretrained().

AC:

  • structured content snapshots are parsed and flattened into a set of passages
  • decide on a path in HDFS to store spark-nlp embedding models and download Qwen3 in a format suitable for spark-nlp
    • write the documentation somewhere about this manual process
  • new passages embeddings are extracted using a model (qwen3 0.6B for now)
    • evaluate if we can leverage the engine kv cache by running passage with similar contexts close to each others on the same node
  • resulting vectors stored in hdfs ready to be picked-up by a process to index a knn field in opensearch

Event Timeline

pfischer set the point value for this task to 8.Jan 12 2026, 4:52 PM