Page MenuHomePhabricator

Semantic search prototype
Closed, ResolvedPublic

Description

The goal is to deliver two semantic search prototypes that surface section text, with a path to compare full-section vs sentence-level results.

  • Prototype 1 (simple wiki): Modify the current interface to surface the full section text (text already indexed).
  • Prototype 2 (tbd wiki): Choose and index a second wiki; implement a pipeline using the same section level embeddings approach as for prototype 1.
  • Investigate whether we can use sentence embeddings to output a more specific selection of text per search result. Out of scope: deeper research on optimal snippet.and advanced IR/LLM methods beyond the sentence-level baseline; production-grade user experiments.

Details

Due Date
Sep 26 2025, 4:00 AM

Event Timeline

fkaelin set Due Date to Sep 26 2025, 4:00 AM.
fkaelin moved this task from Backlog to In Progress on the Research board.

Semantic search prototypes updates:

  • An updated semantic-search prototype is available on https://semantic-search.wmcloud.org, hosted on CloudVPS (it is very slow)
  • There is a dropdown to choose which index to use, with the following options
    • section level search for the following wikis: simple, en_space (en pages that exist in simple), Turkish (tr) and Greek (el)
    • paragraph level search for simple and en_space
    • paragraph level results return a paragraph as "section_text", while the hyperlink is to the section the paragraph is in. The index number is the paragraph, so there can be multiple results pointing to the same section but different paragraphs.
  • Performance:
    • The prototype on CloudVPS is very slow, and even slower if multiple people use it. It can take >10s per query. The instance is I/O bound, i.e. the data is read at the speed of a spinning disk (a phab requesting an increase in the quota), but ideally there would be more RAM. Another option is to host the service on the DSE cluster.
    • The prototype runs much faster on a stat machine http://stat1010.eqiad.wmnet:8000/ (requires an ssh tunnel). That instance also hosts additional indices, e.g. en lead sections for all pages, de all sections, etc. There are 185GB of index files, vs 65GB on Cloudvps
  • Observations:
    • The embeddings were recomputed so that the page title and section name is always prepended to the embedding text, this makes a difference in the paragraph level embeddings.
    • I did not experiment at all with using different models than the e5 large instruct, and neither with trying different prompts than f""" Instruct: Given a natural language query, retrieve relevant sections of wikipedia articles that answer the query Query: {query}"""".
    • These prototypes are hopefully useful for the next phase product discussions, I think we are still a way from a concrete product / user story that makes use of semantic search / embeddings.

Hey @fkaelin ,

Can you share implementation? (dataset generation, and application)
I'm curious to know how it works in more details and it should help with the QA part to get answers as well.

Thank you!