Page MenuHomePhabricator

Deploy and test new MLR models
Closed, ResolvedPublic

Description

T377128: Import recent MLR models built by MjoLniR in production and test them and showed that recent mjolnir models outperform the currently deployed one. T383048: Investigate current MLR models for Search and identify improvements showed that mjolnir models perform well both on easy and hard query classes.

We should deploy recent models and track their performance on the projects where LTR is enabled (minus jawiki for now).

AC:

  • Deploy a recent Mjolnir model.
  • Run an A/B test experiment for a week.
  • Document what would be required to automate deployments.

Event Timeline

Change #1118782 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] cirrus: deploy new mlr models

https://gerrit.wikimedia.org/r/1118782

Change #1118783 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] cirrus: create buckets for mlr 2025 experiment

https://gerrit.wikimedia.org/r/1118783

Change #1118785 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] cirrus: enable mlr-2024 for select wikis

https://gerrit.wikimedia.org/r/1118785

Change #1118782 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: deploy new mlr models

https://gerrit.wikimedia.org/r/1118782

Change #1118783 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: create buckets for mlr 2025 experiment

https://gerrit.wikimedia.org/r/1118783

Mentioned in SAL (#wikimedia-operations) [2025-02-12T08:08:18Z] <dcausse@deploy2002> Started scap sync-world: Backport for [[gerrit:1118783|cirrus: create buckets for mlr 2025 experiment (T385972)]], [[gerrit:1118782|cirrus: deploy new mlr models (T385972)]]

Mentioned in SAL (#wikimedia-operations) [2025-02-12T08:11:21Z] <dcausse@deploy2002> dcausse, gmodena: Backport for [[gerrit:1118783|cirrus: create buckets for mlr 2025 experiment (T385972)]], [[gerrit:1118782|cirrus: deploy new mlr models (T385972)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Change #1119045 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] cirrus: update ltr model on enwiki

https://gerrit.wikimedia.org/r/1119045

Mentioned in SAL (#wikimedia-operations) [2025-02-12T08:25:22Z] <dcausse@deploy2002> Finished scap sync-world: Backport for [[gerrit:1118783|cirrus: create buckets for mlr 2025 experiment (T385972)]], [[gerrit:1118782|cirrus: deploy new mlr models (T385972)]] (duration: 17m 03s)

Change #1119045 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: update ltr model on enwiki

https://gerrit.wikimedia.org/r/1119045

Mentioned in SAL (#wikimedia-operations) [2025-02-12T08:28:57Z] <dcausse@deploy2002> Started scap sync-world: Backport for [[gerrit:1119045|cirrus: update ltr model on enwiki (T385972)]]

Mentioned in SAL (#wikimedia-operations) [2025-02-12T08:31:58Z] <dcausse@deploy2002> gmodena, dcausse: Backport for [[gerrit:1119045|cirrus: update ltr model on enwiki (T385972)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-02-12T08:42:07Z] <dcausse@deploy2002> Finished scap sync-world: Backport for [[gerrit:1119045|cirrus: update ltr model on enwiki (T385972)]] (duration: 13m 10s)

Change #1118785 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: enable mlr-2025 for select wikis

https://gerrit.wikimedia.org/r/1118785

Mentioned in SAL (#wikimedia-operations) [2025-02-13T08:42:22Z] <dcausse@deploy2002> Started scap sync-world: Backport for [[gerrit:1118785|cirrus: enable mlr-2025 for select wikis (T385972)]]

Mentioned in SAL (#wikimedia-operations) [2025-02-13T08:45:24Z] <dcausse@deploy2002> dcausse, gmodena: Backport for [[gerrit:1118785|cirrus: enable mlr-2025 for select wikis (T385972)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-02-13T09:01:28Z] <dcausse@deploy2002> Finished scap sync-world: Backport for [[gerrit:1118785|cirrus: enable mlr-2025 for select wikis (T385972)]] (duration: 19m 06s)

TL;DR: Everything looks quite reasonable, with the usual expected variation here and there. The one exception is Japanese, where the new model is clearly (but without a clear reason) outperforming the old model. If the stats were reversed, I wouldn't recommend deploying the new model. I'm not sure how to extract out my preference there into something that could be automatically approved or blocked.

Great job on getting all the config and data and reports working!


I love bullet points—so here we go!

Background thoughts and ideas:

  • Some of the measures in the report are not affected by the test/control status, since MLR only affects ranking, and not recall.
    • Zero-results rate in particular is not affected by MLR, so the ZRR results are "irrelevant" to the test/control condition. If we see a big difference in ZRR, that means the samples in the buckets may actually be different from each other, and one or both probably aren't a perfect representative of the underlying "truth".
  • Both "relevant" and "irrelevant" measures can tell us something about the overall quality/health of search on a given wiki, or hint at differences in the way that community uses their wiki. A wiki with a ~5% ZRR is probably very different from one with a ~50% ZRR, and the effect of any change on such wikis could be different. Same for very high or very low clickthrough rates, dwell times, etc.
  • Back in 2016 we did some null A/B tests on clickthrough rates—one a day for a week. The test and control conditions were the same, and the only differences was the random sampling to divide up the groups. Looking at both individual searches and search sessions, we saw 1% variances weren't uncommon, and we had one day with a 4% delta in clickthrough rates (with a 95% confidence interval at about 3% to 5.6%), despite there being no actual difference.
    • I finally moved this experiment from email to on-wiki for future reference.
  • A 95% confidence interval or 95% credible interval means there is still a 5% chance that the true values lie outside the 95% interval. That's 1 in 20.
  • I consider measures in the report to be "fairly independent" if they don't directly affect or relate to each other. Overall search quality affects all the measures, but things like any measure "per query" and "per session" are more closely related, as are "only zero results" and "at least one zero result", and "first-clicked" and "last-clicked" positions and all the specific positions measured.
  • In general, though, there are a good number of "fairly independent" measures—e.g., zero-results rate is only indirectly related to clickthrough rate. (If ZRR were 100% then clickthrough would have to be 0% because there's nothing to click on, but away from that extreme end, "there are results" and "the results are pretty good and well-sorted" are different things.)
    • To me, that means we have enough 95% intervals to suggest a couple of true values could easily be outside the interval. When we see relevant measures with, for example, almost touching intervals, they aren't necessarily that different. As above, if irrelevant measures (like ZRR in the case of MLR) are wildly different, something could be up!
    • In general though, I'd want to see several discrepancies in different relevant measures (or some wild divergence in one measure) that wasn't at least hinted at by differences in irrelevant measures, before I'd say one model was clearly less good than another when, as with MLR retraining over time, we expect them to be pretty similar.

Observations and thoughts on the reports:

  • Gabriele and I talked about this briefly in the Wednesday meeting: it's definitely important to look at the axes of the various plots. Sometimes there are several labels on the y axis that are the same because the scale is actually so small. Visually distinct differences are not the same as numerically meaningful differences. It's important to pay attention to the context when reading the reports!
  • Including mlr-2025-02i (the interleaved sample) in the non-interleaved part of the report is a bit odd in some places, but it is still interesting to compare to the control and test samples.
    • The interleaved results should have the same ZRR. Clickthrough and some other gestalt quality measures could lean more toward the better bucket, and some could split the difference. (e.g., If the control gives bad results, and the test gives good results, the interleaved will have both mixed together, and you'd expect clickthrough rates to be more similar to the better results, since a good result should be either first or second in the interleaved results.)
    • That said, interleaving can dramatically effect the details of the clicks in the first-clicked and last-clicked measures. There is a strong general effect of position (esp. first, second, and third) on clickthrough, but if a clearly better result is interleaved into second position behind a less-good result, I'd expect to see fewer clicks in first position and more clicks in second position. For some of these wiki samples, that's exactly what happens!
      • Interestingly, the it, nl, pl, ru, sv, vi, and zh mlr-2025-02i models do not stick out in the number of second-place clicks.
      • The no mlr-2025-02i model rates slightly higher in 1st- and 2nd-place clicks, and only sticks out as having fewer clicks in 4th+ place! (But, really, all the 95% intervals overlap a lot.)
  • For a bit more than half of the wikis (en, he, it, ko, nl, no, pl, ru, vi, zh), it looks like things are the pretty much same between the old and new models, and the interleaving results are essentially the same.
    • Based on ZRR, it looks like ...
      • ... the new de, fi, fr models got slightly tougher samples ...
      • ... the old fa model got a moderately tougher sample ...
      • ... the old id model got a slightly easier sample ...
      • ... the interleaved vi model got a moderately easier sample ...
        • ... but in these cases it didn't have much effect on the results.
    • Based on the interleaving results ...
      • ... the nl model has a 95% confident preference for the B interleaved results (which is probably the new model?), but just barely.
      • ... the sv model has a 95% confident preference for the B interleaved results (which is probably the new model?); it is small but definite (modulo 95% credible intervals).
  • The results for ja stick out as unusual compared to all the rest.
    • Based on ZRR, it looks like the old ja model got a moderately tougher sample.
    • The clickthrough rates and satisfaction rates are very different between the old and new model—roughly 4% in all cases. The interleaved model is similar to the new model, but we'd sort of expect that even if there is a difference between the models, because the good new model results would be in first or second place in the interleaved model.
    • The click position data is also very different. The new model has much higher (+6%) first-position clicks, and notably lower (-2%) second-position clicks. The interleaved model has notably higher second-position clicks (+2% over the old model, +4% over the new model), which indicates to me that the new model really is outperforming the old model, and not just that the old model got a tougher sample to work with.
    • I'm not sure what to make of the number of searches per session. The 95% credible intervals are very far from each other, but the absolute magnitude of the difference seems small (1.50 for the old model, 1.46/1.47 for the new/interleaved models). An increase of 0.4 searches per session means that 1 in 25 people did an extra search with the old model. That could reflect poorer quality results from the old model.
    • There is a big difference in the interleaved results, with a strong preference for the B model (which is probably the new model?). The preference has a magnitude of 0.12, while sv—the interleaved test with the second strongest preference—was only 0.03, with error bars down to 0.01. ja's error bars don't show on the graph, but definitely don't cross 0.10.. so the effect is 4x to 10x bigger.
    • I'd love to say that changes to Japanese language analysis deserve the credit, but those haven't been deployed or activated yet!
    • The Japanese samples are plenty big—over 170K searches and over 60K sessions per bucket.
    • Since the preference is in favor of the new model, it isn't much of an issue here, but if the stats were reversed, I don't think I'd want to deploy the new model.

Minor design and formatting thoughts:

  • In or near the "Preference of Interleaved Rankers" graph, it'd be nice to know which is the A group and which is the B group. I don't think that is actually spelled out anywhere in the data or code of the report. I guess it's in the config of the A/B test itself. Adding INTERLEAVED_A = "Control" and INTERLEAVED_B = "Test" to the #Injected Parameters section and using them in or near the "Preference of Interleaved Rankers" graph would be very helpful.
  • It would be nice if the reports had the name of the wiki in more conspicuous places.
    • Right now it's in the third code block WIKI = "jawiki"—though they all have WIKI = "idwiki" in the second code block.
    • The only other place it occurs is in the first subheading under "Data-Gathering".
    • It would be nice in the main heading ("AB Test of Mjolnir 2025-02 jawiki Models" or "jawiki: AB Test of Mjolnir 2025-02 Models") or the page title. It would be better at the beginning of the file name (the URL is so long it is hidden). It would be awesome if it was in the TOC ("Table of contents (jawiki)") because then it would always be visible.
    • This is a very minor thing (despite how much I've written)... it was just a little confusing when I had 6 or 8 reports open at once!
  • This is very, very minor, and may not be worth trying to do automatically, but when the range of the y axis on some plots is so small that multiple labels on the y axis are the same, or the labels in the plots themselves are the same (but visually one is clearly above the other), I'd like another digit of accuracy. On the other hand, maybe I shouldn't get it, and it's a sign that there's really not much distinction to be made in those measures. I dunno.

TL;DR: I had the wrong mental map of what the control and test cases were for Japanese, and all the differences we see actually make sense—and look really good! Ridiculous session lengths (30+ years) have been detected and explained. We still have questions on how to automate A/B test acceptance criteria.


This week we have been looking at the reports and some of the data behind the reports; we ended up with a few more questions, and—fortunately—a whole lot of answers! I'm summarizing recent discussions and discoveries here for reference.

I noticed that the maximum session lengths for some wikis were very long. 10^6 seconds is 11 days. 10^9 seconds is almost 32 years! It's probably actually 10^9.2, which takes us all the way back to the beginning of the Unix epoch—which at least makes some kind of sense—if a start time was reported as "0". In theory, we shouldn't have sessions over 500 min (the max of 50 queries exactly the max of 10 minutes apart) ≅ 10^4.5 seconds.

Gabriele and Erik investigated and verified that one element of the recorded session time is supplied by the browser, which is normally reasonably accurate, but which can be set to anything. Gabriele add a filter for the extreme sessions, and there were only a handful.

I thought eliminating a 1.7 billion–second session or two would change the average session length, but the average was already the median (not the mean) and the outliers had no noticeable effect, whether included or excluded. Duh.

I also had forgotten that after the previous A/B tests that Erik performed last fall, we had decided to enable MLR for Korean and Chinese (these and Japanese had been excluded because the spaceless languages didn't perform particularly well in the very early tests). I thought that was still in the future.

Somehow I also (incorrectly) deduced that the current A/B tests were comparing Erik's built-but-not-deployed Japanese MLR model from last fall against Gabriele's new built-but-not-deployed Japanese MLR model. Turns out the Japanese control is the current production non-MLR configuration. That explains why the old "model" and the new model differ so much.. the Japanese MLR model now really is clearly better than the non-MLR config.

(Note: Deploying any Japanese MLR model is still on hold because upcoming changes to the Japanese analysis chain will shake up Japanese processing and may not be compatible with a model trained on data using the current analysis config. Deploying the new Japanese analysis chain is unfortunately also delayed until we upgrade to OpenSearch 2.x, because that process is already underway and introducing new plugins is too much added complexity.)

Based on all this, the results of the ja/Japanese A/B test make a lot more sense, because we aren't comparing two MLR models that we expect to be roughly similar in ranking quality. Erik's older A/B tests show the same preference for the MLR model over the non-MLR config for zh/Chinese, ko/Korean, and ja/Japanese.

We still have the question of how to determine that a difference in an A/B test at least as big as that of the non-MLR configs vs the MLR models could be detected in a more automated system.