Page MenuHomePhabricator

Evaluate vector search models for article recommendation using reading lists as ground truth
Closed, ResolvedPublic

Description

The aim of this task is to use reading lists to evaluate (and potentially train) models for recommending related articles that are interesting for readers.

We want to run an experiment to recommend related articles:

  • Define experiment setup (ground truth data and evaluation metric)
  • Evaluate morelike as a baseline
  • Evaluate vector search with topics embeddings
  • Evaluate vector search with text embeddings using LLMs
  • (stretch) fine-tune (train) text embedding model with data from reading lists

Context: We conducted exploratory analysis on reading lists in T382493. This suggested that articles appearing together in the same reading list can be considered relevant recommendations as related articles since they were specifically curated by readers. This allows us to use reading list as a ground truth dataset to systematically evaluate different models for recommending related articles. The current baseline for generating related articles in Wikipedia is cirrussearch' morelike. Research has developed tooling to generate embeddings for similarity search (aka vector search). This can be used to develop an alternative (and hopefully better) model to generate related articles. Furthermore, we can fine-tune these models using data from existing reading lists to further improve the recommendations.

The experiment around recommending related articles thus provides a testing ground for a specific use case to demonstrate whether and how much vector search approaches can improve on traditional search approaches.

Event Timeline

Update:

  • Defined experiment: recommending related articles. For each reading list, we choose one article randomly as the source article, and all other articles as relevant related articles (target).
  • Defined evaluation metric: hits@k. for a source article, if the first k recommended articles contain at least one of the target articles, we consider this a relevant recommendation ("hit"). Hits@k is the fraction of cases, for which the recommendation is relevant.
  • Ran morelike baseline and one vector search model (topics-embeddings)
  • Next: run text-based embeddings as comparison.

Update:

Update:

  • managed to fine-tune one of the embedding models using training data from the reading lists
  • currently putting together documentation with results from different models for experiments with English Wikipedia; will post a summary in the next week (and closing out this task then)

Update:

  • finished experiments
  • wrote up documentation with findings as well as methodology, etc (internal googledocs)
  • closing task as work from task is completed.