Page MenuHomePhabricator

Collect candidate search results for set of sample queries
Closed, ResolvedPublic

Description

Once we identified a sample of queries T408121, we need to collect candidate search results that can be annotated for relevance by raters later.

Some considerations

  • how to generate candidates? WP search, other external search engines?
  • how many results per query?

Event Timeline

  • Explored the options for external search engines. Most of the options can't be used because our use case might not comply with the law or Terms of Service.
  • Created and initial pipeline to search for candidates using Wikipedia's internal search. The results will be served in the format of:
{
    "title": <"title">,
    "snippet": <"snippet">,
    "pageid": <"pageid">,
}
  • Explored the possible logic for paragraph reranking for the selected candidates' pages. I recommend proceeding with pretrained crossencoder models. In particular, I tested the pipeline based on jina-reranker (pipeline notebook with usage example). Potentially, we can improve performance with better models, such as Qwen3-Reranker, but I currently face infrastructure constraints.

Implemented the search results selection logic with the following workflow:

  • Collect search results from the search engine at the page level (currently limited to Wikipedia; this should be extended to additional sources to avoid selection bias).
  • Compute ranking scores for all paragraphs within the retrieved pages using a cross-encoder reranker model.
  • Select the top five ranked paragraphs, with a maximum of two paragraphs per source page.

final update - task is completed

  • We collect top-10 article of results from Wikipedia search and an external search, respectively.
  • We identify the top-10 paragraphs from the selected articles (with at most 2 paragraphs from the same article)
  • We collected candidate results for the 600 final selected queries