Page MenuHomePhabricator

EPIC: Plan to enable BM25 on fulltext search
Closed, ResolvedPublic

Description

Reflexions on BM25 started few months ago while we reviewed scoring techniques (T125603).
We concluded that the use of the lucene ClassicSimilarity (very simple tf/idf) is what prevents us from moving forward and implementing new scoring techniques in cirrus.

Plan to enable BM25 :

  1. T139576: Enable BM25 by default in cirrus and evaluate its impact with relcomp on relforge servers
  2. T128073: Implement a new fulltext query to drop the allfield
  3. T139577: Switch to a weighted sum for incoming links and possibly include pageviews
  4. T139579: Evaluation, use discernatron data and Paul's score to run an offline evaluation
  5. T139584: Possibly reindex enwiki on eqiad and run an A/B test between eqiad and codfw
  6. T139585: If the A/B test is successfull: reindex all wikis
  7. T139586: Remove old code in cirrus and actually drop the allfield to save space

Optional tasks that could be nice to implement before we reindex anything:

  1. T107006: Add a "reverse" suggestion field to workaround the prefix length limitation (typos suggestion)
  2. T137830: Use the icu_folding filter if available instead of asciifolding
  3. T134978: Add DEFAULTSORT keys to wiki search autocomplete
    • Seems to be a good idea but requires some discussions first (actually requires a full reindex, but we could at least add this field in the mapping while we reindex)
NOTE: some of these tasks require mapping/analysis config changes, it'd be nice to merge the ongoing refactoring (T89733) before starting to work on this.

Related Objects

StatusAssignedTask
ResolvedDeskana
Resolveddcausse
Resolveddcausse
Resolveddcausse
ResolvedDeskana
ResolvedTJones
DeclinedNone
DeclinedNone
OpenNone
ResolvedDeskana
Resolveddcausse
ResolvedEBernhardson
Resolveddcausse
ResolvedEBernhardson
Resolveddcausse
DeclinedNone
ResolvedGehel
ResolvedGehel
DeclinedGehel
ResolvedGehel
ResolvedTJones
ResolvedTJones
ResolvedTJones
Resolveddcausse

Event Timeline

dcausse created this task.Jul 7 2016, 9:37 AM
Restricted Application added a project: Discovery. · View Herald TranscriptJul 7 2016, 9:37 AM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
dcausse updated the task description. (Show Details)Jul 7 2016, 10:55 AM
dcausse updated the task description. (Show Details)Jul 7 2016, 11:00 AM
dcausse updated the task description. (Show Details)Jul 7 2016, 5:44 PM
TJones added a subscriber: TJones.Jul 7 2016, 5:52 PM
debt moved this task from This Quarter to Up Next on the Discovery-Search board.Jul 19 2016, 10:07 PM
debt triaged this task as Medium priority.Jul 20 2016, 4:01 PM

This is done for the top ten wikis (roughly 85% of our traffic). This is waiting on the analysis for BM25 for zh/ja/th in (T147500). It's tedious to configure a bunch of wikis different ways... but if that analysis takes a long time, we'll have to do the tedious thing so our users can benefit. :-)

Deskana closed this task as Resolved.Jan 17 2017, 6:15 PM
Deskana claimed this task.

We did all of this except the spaceless languages (e.g. zh, ja, th) due to BM25 seeming to be worse for those languages. That's close enough to call this resolved.