Reflexions on BM25 started few months ago while we reviewed scoring techniques (T125603).
We concluded that the use of the lucene ClassicSimilarity (very simple tf/idf) is what prevents us from moving forward and implementing new scoring techniques in cirrus.
Plan to enable BM25 :
1. Enable BM25 by default in cirrus and evaluate its impact with relcomp on relforge servers
2. Implement a new fulltext query to drop the allfield
3. Switch to a weighted sum for incoming links and possibly include pageviews
4. Evaluation, use discernatron data and Paul's score to run an offline evaluation
5. Possibly reindex enwiki on eqiad and run an A/B test between eqiad and codfw
6. If the A/B test is successfull: reindex all wikis
7. Remove old code in cirrus and actually drop the allfield to save space
Optional tasks that could be nice to implement before we reindex anything:
1. T107006: Add a "reverse" suggestion field to workaround the prefix length limitation (typos suggestion)
2. T137830: Use the icu_folding filter if available instead of asciifolding
3. T134978: Add DEFAULTSORT keys to wiki search autocomplete
- Seems to be a good idea but requires some discussions first (actually requires a full reindex, but we could at least add this field in the mapping while we reindex)
NOTE: some of these tasks require mapping/analysis config changes, it'd be nice to merge the ongoing refactoring before starting to work on this.