The phrase suggester is a feature used by CirrusSearch to provide Did You Mean suggestions.
For perf/size reasons the field used by this suggester is populate with title & redirect texts.
It is believed that this type of suggester works better on relatively large corpus containing more than just titles.
We added the option to feed this suggest field with the opening_text as well, unfortunately we haven't been able to test this behavior because like all features depending on index time config it is very hard to A/B test them. Additionnaly the suggest field is part of the MLR features and changing it could possibly have negative consequences if not re-trained appropriately.
To ease flexibility & testing we could consider creating a dedicated index per language that would be fed from the various text fields available from the cirrus dump in hive.
CirrusSearch would have to be adapted to allow creating a separate suggest query to this index.
The nature of the text that has to be pulled is up for discussion but using a separate index can certainly increase our ability to iterate a lot quicker.
A proof-of-concept could perhaps be tested before automating this pipeline by manually creating an index. We could consider re-using the glent pipeline to automate it.
AC:
- Glent is able to construct a dataset fit to build an index dedicated to run suggest queries with the phrase suggester
- Quick study about what content is appropriate (e.g. title+opening_text, title+redirects+opening_text, ...)
- Create an index fit for the phrase_suggester for a couple languages
- Adapt CirrusSearch to be able to use an separate index to fetch its DYM suggestions from the phrase suggester
- Run an A/B test on a set of wikis
- Depending on the outcome automate the pipeline with glent (or something else)
- Test & expand the feature to more languages/wikis