Reflexions on BM25 started few months ago while we reviewed scoring techniques (T125603).
We concluded that the use of the lucene ClassicSimilarity (very simple tf/idf) is what prevents us from moving forward and implementing new scoring techniques in cirrus.
Plan to enable BM25 :
- T139576: Enable BM25 by default in cirrus and evaluate its impact with relcomp on relforge servers
- T128073: Implement a new fulltext query to drop the allfield
- T139577: Switch to a weighted sum for incoming links and possibly include pageviews
- T139579: Evaluation, use discernatron data and Paul's score to run an offline evaluation
- T139584: Possibly reindex enwiki on eqiad and run an A/B test between eqiad and codfw
- T139585: If the A/B test is successfull: reindex all wikis
- T139586: Remove old code in cirrus and actually drop the allfield to save space
Optional tasks that could be nice to implement before we reindex anything:
- T107006: Add a "reverse" suggestion field to workaround the prefix length limitation (typos suggestion)
- T137830: Use the icu_folding filter if available instead of asciifolding
- T134978: Add DEFAULTSORT keys to wiki search autocomplete
- Seems to be a good idea but requires some discussions first (actually requires a full reindex, but we could at least add this field in the mapping while we reindex)