Page MenuHomePhabricator

Improve CirrusSearch DYM suggestions using the phrase suggester on more content
Open, MediumPublic

Description

The phrase suggester is a feature used by CirrusSearch to provide Did You Mean suggestions.
For perf/size reasons the field used by this suggester is populate with title & redirect texts.
It is believed that this type of suggester works better on relatively large corpus containing more than just titles.
We added the option to feed this suggest field with the opening_text as well, unfortunately we haven't been able to test this behavior because like all features depending on index time config it is very hard to A/B test them. Additionnaly the suggest field is part of the MLR features and changing it could possibly have negative consequences if not re-trained appropriately.

To ease flexibility & testing we could consider creating a dedicated index per language that would be fed from the various text fields available from the cirrus dump in hive.
CirrusSearch would have to be adapted to allow creating a separate suggest query to this index.

The nature of the text that has to be pulled is up for discussion but using a separate index can certainly increase our ability to iterate a lot quicker.

A proof-of-concept could perhaps be tested before automating this pipeline by manually creating an index. We could consider re-using the glent pipeline to automate it.

AC:

  • Glent is able to construct a dataset fit to build an index dedicated to run suggest queries with the phrase suggester
  • Quick study about what content is appropriate (e.g. title+opening_text, title+redirects+opening_text, ...)
  • Create an index fit for the phrase_suggester for a couple languages
  • Adapt CirrusSearch to be able to use an separate index to fetch its DYM suggestions from the phrase suggester
  • Run an A/B test on a set of wikis
  • Depending on the outcome automate the pipeline with glent (or something else)
  • Test & expand the feature to more languages/wikis

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel triaged this task as Medium priority.Apr 14 2025, 3:29 PM
Gehel moved this task from needs triage to Next Projects on the Discovery-Search board.

Per the findings in T396779, i think we can greatly simplify this. The initial premise was that the phrase suggester indices would be too large, but current analysis says we have plenty of headroom. An alternate implementation:

Iniital test:

  • Add a new top level suggest field, suggest_variant
  • Perform brief study, per initial ticket, into which fields to use for suggest. Use copy_to to bring those fields into the suggest_variant field.
  • Run an AB test comparing use of the phrase suggester against suggest vs suggest_variant fields
  • Use the same metrics and mostly the same analysis as the Glent Method 1 AB test

Deployment:

  • Make some one-off adjustments to mjolnir to exclude specific features, with the intent of training a model that does not use the suggest field. Update prod with the restricted model.
  • Once mjolnir is not using the suggest field, remove the suggest_variant field used for AB testing and apply those settings to the suggest field
  • Once deployed everywhere let mjolnir train against the suggest field again

Upsides:

  • No new indexes to manage.
  • No new update process to manage, updates flow as they always have.

Downsides:

  • No specific control over what content goes into the suggester, it's whatever the wiki has. Having a dedicated index gives lots of flexibility to introduce additional content.
  • No ability to share language statistics between wikis in the same language

We were reminded with the recent Glent M1 AB test that the primary impact of query suggestions is in saving zero results queries via automatic rewrites. Did-you-mean suggestions for queries that have results are generally not great, we see clickthrough rates of < 1%, meaning 99 out of 100 query suggestions are ignored. There is of course a chicken and the egg problem there, maybe if we had better suggestions we would see better rates, but i suspect we should scale our effort to where we can expect to see the most change.

My suggestions would be to go with the simpler implementation. If succesfull maybe we can revisit a more flexible solution.

Change #1184899 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Add profiles for suggest_variant

https://gerrit.wikimedia.org/r/1184899

Change #1184899 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add profiles for suggest_variant

https://gerrit.wikimedia.org/r/1184899

Change #1187104 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Adjust new dym profiles to match ab test

https://gerrit.wikimedia.org/r/1187104

Change #1187108 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Start AB test of did-you-mean profiles

https://gerrit.wikimedia.org/r/1187108

Change #1187104 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Adjust new dym profiles to match ab test

https://gerrit.wikimedia.org/r/1187104

Change #1187108 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: Start AB test of did-you-mean profiles

https://gerrit.wikimedia.org/r/1187108

Mentioned in SAL (#wikimedia-operations) [2025-10-02T20:21:08Z] <ebernhardson@deploy2002> Started scap sync-world: Backport for [[gerrit:1187108|cirrus: Start AB test of did-you-mean profiles (T390858)]]

Mentioned in SAL (#wikimedia-operations) [2025-10-02T20:25:49Z] <ebernhardson@deploy2002> ebernhardson: Backport for [[gerrit:1187108|cirrus: Start AB test of did-you-mean profiles (T390858)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-10-02T20:30:37Z] <ebernhardson@deploy2002> Finished scap sync-world: Backport for [[gerrit:1187108|cirrus: Start AB test of did-you-mean profiles (T390858)]] (duration: 09m 29s)

Change #1196127 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] Revert "cirrus: Start AB test of did-you-mean profiles"

https://gerrit.wikimedia.org/r/1196127

Preliminary reports. They might become final, but they haven't been reviewed by anyone else yet:

In both breakdowns default_1 outperforms the other two variations. Top level stats for the change from control to the prefix_len=1 variant:

!! - statistically significant change

ALL WIKIS

display rate: 23.0 -> 24.5 !!
dym selection rate: 0.60 -> 0.61
auto rewrite rate: 9.26 -> 9.99 !!
user selected clickthrough rate: 26.2 -> 26.0
auto rewrite clickthrough rate: 19.9 -> 20.6 !!
all query clickthrough rate: 33.03 -> 33.13

EN, DE, FR

display rate: 28.6 -> 29.8 !!
dym selection rate: 0.52 -> 0.53
auto rewrite rate: 10.23 -> 10.86 !!
user selected clickthrough rate: 31.2 -> 30.9
auto rewrite clickthrough rate: 24.5 -> 25.3 !!
all query clickthrough rate: 37.90 -> 38.03

Thanks for running all the individual language reports—very interesting to look at them all. I skimmed them, comparing the charts for each to the all_wikis charts.

I'm surprised srwiki has so few search sessions! I thought hiwiki was bigger than it is. Also too bad swwiki doesn't have more data, but I was no longer surprised by the time I got to that one.

Clearly English dominates the all_wikis stats, but a lot of other wikis have very similar patterns of results.

It looks like default_1v is much more variable across wikis, which kind of makes sense because the amount of opening text available may vary a lot by wiki based on conventions of the wiki and maturity of the wiki. Those with very little opening text for whatever reason could do worse—I can imagine a younger, smaller wiki with some significant amount of opening text being boilerplate, which would not help make a quality DYM language model.

Hopefully 1v2 will allow opening text to help when it can, but titles & redirects will provide a sort of backstop with some baseline level of quality.

(It's based on very few queries, but the Selection Rate of “Did You Mean” Suggestions for Swahili (swwiki) is kinda of funny... "uh, no thanks" (all zeros).)

The spaceless languages are suffering. Japanese and Chinese having Automatic Query Rewriting Rates hovering around 2%. Do we know if any analysis happens during the DYM internal processing? If not, I wonder if we could introduce some tokenization—maybe even adding spaces to queries—to help find reasonable suggestions.

Overall, it looks like default_1 is usually better than default, and, on larger samples, never statistically worse. default_1v is less reliable—sometimes better, sometimes worse. So, I'm still happy with deploying default_1 sooner rather than later, though waiting for the next AB test is okay, too.

We could look into configuring 1 /1v (and maybe 1v2) by wiki, but we'd need more data for smaller wikis (10 week–1 year A/B tests would be kinda nuts), and I think default_1 is generally an inherent structural improvement (with the possible exception languages using Chinese characters, where one character can carry so much meaning—but even then it doesn't seem worse, per se, just not better). I worry about what causes those big differences between languages for default_1v, and whether they could change over time (though it would likely be over years), making it better or worse than default_1 as the conventions and maturity (or other features we haven't considered) of a given wiki change.

My hope is that 1v2 will be so obviously better all around that (and probably occasionally just not worse) that it'll be the clear best choice... but that's what A/B tests are for!

Change #1196127 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert "cirrus: Start AB test of did-you-mean profiles"

https://gerrit.wikimedia.org/r/1196127

Mentioned in SAL (#wikimedia-operations) [2025-10-16T20:06:13Z] <ebernhardson@deploy2002> Started scap sync-world: Backport for [[gerrit:1196127|Revert "cirrus: Start AB test of did-you-mean profiles" (T390858)]]

Mentioned in SAL (#wikimedia-operations) [2025-10-16T20:10:56Z] <ebernhardson@deploy2002> ebernhardson: Backport for [[gerrit:1196127|Revert "cirrus: Start AB test of did-you-mean profiles" (T390858)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-10-16T20:15:49Z] <ebernhardson@deploy2002> Finished scap sync-world: Backport for [[gerrit:1196127|Revert "cirrus: Start AB test of did-you-mean profiles" (T390858)]] (duration: 09m 36s)