Page MenuHomePhabricator

Evaluate features provided by `query_explorer` functionality of ltr plugin
Closed, ResolvedPublic

Description

The plugin can generate a variety of term statistic based features, try them out and see which are useful to include in training.

Event Timeline

EBernhardson triaged this task as Normal priority.Feb 13 2018, 4:03 AM
EBernhardson created this task.

Built out feature collection for these, the dataset sizes are a bit larger than expected. xgboost datasets for enwiki when maintaining 35M observations is roughly 50GB. Interestingly the lightgbm dataset for the same data is only 4GB. Similarly while xgboost needed 10 executors with 16G memory each to run training, lightgbm looks like it will only require 1 executor with 10G of memory. Might be worth evaluating if the ltr plugin can correctly evaluate lightgbm trees (if we convert them to xgboost format).

Training is being done against the 20180118 dataset, for which we have baselines and estimates of training gain variance in T186134. This gives ndcg@10 of 0.8391 +- 0.0005.

First attempt at training xgboost I've also found it's extremely slow. 10 executors with 10 cores each is taking about 30 minutes to train a model. On the upside the first two CV jobs run came back with ndcg@10 of 0.8525 and 0.8510, or approximately 8% of the possible improvement. This is only two cv runs so better models may be found. Due to the compute necessary to train a model we will not be able to train anywhere near the 150 hyperparameter rounds we typically use.

For next steps I'm going to evaluate some of the trained models and see if there are large amounts of features we can drop and still retain most of the improvement.

After evaluating found the previous iteration was missing all the features related to the plain fields due to a bug. Re-collected the data and came up with 0.8627. Unfortunately there is something odd going on at evaluation time. I uploaded the 0.8628 model as [[ https://en.wikipedia.org/wiki/Special:Search?search=kennedy&fulltext=1&cirrusMLRModel=20180118-query_explorer_v2-enwiki-v1 | 20180118-query_explorer_v2-enwiki-v1 ] and it takes on average around 2s to return results. Taking the query being issued here and running it against codfw comes back at 300ms. 300ms is stlil probably too expensive, but 2s is amazing. The eqiad cluster load doesn't seem high enough (~25% cpu, minimal io) to make the queries 6x slower. Unfortunately it looks like the ltr plugin deployed in prod doesn't work with the elasticsearch profile api and throws an exception.

Even if we fix the 2s query, 300ms is still much more expensive than the current query, so investigated feature selection. To evaluate feature selection i used the spark-infotheoretic-feature-selection package. This required discretizing all the features into integers 0-255. The package has a suggested package for doing this, but I let it run for over 2 hours against a dataset of 35Mx250 and it gave no signs of completing any time soon. Instead i applied spark's QuantileDiscretizer to make 255 approx (boundary accuracy of 1/2550) equally sized bins and discretized that way. This took only a few minutes to run rather than hours. The package offers multiple feature selection algorithms, i applied them all asking for the top 50 features (20%) and trained an xgboost model for 10 hyperparameter iterations giving the following results:

strategycv-test-ndcg@10
Mutual Information Maximization (MIM)0.8509
Mutual Information FS (MIFS)0.8501
Joint Mutual Information (JMI)0.8513
Minimum Redundancy Maximum Relevance (MRMR)0.8624
Conditional Mutual Information Maximization (CMIM)0.8509
Informative Fragments (IF)0.8507

I find it interesting they are mostly clustered around 0.85, except mrmr which manages to retain almost the same score as the model with 250 features at 0.8624. This is also not just an anomaly of the hyperparameter randomness, all except 1 iteration of mrmr was > 0.86.

Some investigation into the selected featuresets, a shared feature count matrix:

mimmifsjmimrmrcmimif
mim00123099
mifs00123099
jmi12120311717
mrmr30303103535
cmim99173500
if99173500

From here we can see if and cmim returned same feature sets, as well as mim and mifs. mrmr clearly selected a completely different set of features than everything else. At least in this round mrmr is a clear winner, will need to repeat for other wikis and against other datasets (drawn from a different time period) to see if we can simply set it to mrmr going forward or if we need to try and select a feature selection algorithm for each individual model.

The feature set chosen (in no particular order) was: title_match, title_stddev_raw_df, title_min_raw_ttf, title_plain_match, title_dismax_plain, title_plain_min_raw_df, title_plain_stddev_classic_idf, title_plain_min_raw_ttf, redirect_title_match, redirect_title_plain_match, redirect_title_dismax_plain, redirect_title_plain_min_raw_df, redirect_title_plain_min_raw_ttf, heading_match, heading_stddev_classic_idf, heading_plain_match, heading_dismax_plain, heading_plain_min_raw_df, heading_plain_min_raw_ttf, opening_text_mean_raw_ttf, opening_text_plain_match, opening_text_dismax_plain, text_sum_classic_idf, text_min_raw_ttf, text_plain_match, text_dismax_plain, category_match, category_min_raw_ttf, category_plain_match, category_dismax_plain, category_plain_min_raw_df, category_plain_mean_classic_idf, category_plain_min_raw_ttf, auxiliary_text_plain_match,auxiliary_text_dismax_plain, auxiliary_text_plain_sum_classic_idf, auxiliary_text_plain_min_classic_idf, auxiliary_text_plain_stddev_classic_idf, suggest_match, all_near_match, redirect_or_suggest_dismax, all_phrase_match, all_plain_phrase_match, all_phrase_match_dismax_plain, title_unique_terms, title_plain_unique_terms, title_unique_terms_diff_plain, popularity_score, incoming_links, text_word_count

Breakdowns of feature by document field:

titlecategoryheadingredirect.titletextauxiliary_textopening_textall_phrasequery term countssuggestall_near_matchredirect or suggestpopularityincoming_links
87655533311111

By feature type:

matchdismaxraw_ttfclassic_idfraw_dfunique term countspopularity scoreincoming linkstext word count
1598753111

By feature relationship to query/doc. Perhaps notable we have only 3 doc only features and it took all of them.

query onlydoc onlyq + d
10337

I've also now created a limited feature set with just what the mrmr model needs and uploaded it to the prod clusters. Query time on the codfw cluster looks really good, around 125ms. Query time on the eqiad cluster on the other hand is hanging out around 750ms which is unacceptable. Going to have to dig into why the queries take significantly different amounts of time on the different clusters. I'm suspicious it's not actually feature collecting, but maybe model caching? This is because if i adjust the query to only have 6 results, so only 6 possible items could have been provided to feature collecting across 7 shards, i still see ~750ms. This suggests some sort of constant cost rather than a per-item cost like feature collection.

After some testing my theory is the performance on eqiad is due to to our ltr model cache size. It 's currently set to the default of 10MB. If i run the ltr model in a series of requests at only 1 concurrent query i get poor performance. Ramping up to 5 concurrent queries per-query performance is still poor but improving. At 15 concurrent queries things seem to generally improve, with nothing over 500ms in the last few dozen of the 1k. My interpretation of this is that the more likely the model is to be in the cache the faster the request, and the model is being evicted very quickly. Unfortuntaely the cache size is not dynamically updatable, we have to do a rolling restart to update it. I've filed a bug upstream for this. Additionally i don't have stronger proof the cache is undersized because we only report number of items and size in bytes for cache stats, nothing about loads or invalidations from which trends over time could indicate churn. I've filed an upstream bug about that as well.

Change 415754 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Setup Cirrus AB test

https://gerrit.wikimedia.org/r/415754

Change 415776 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] Start Cirrus query explorer AB test

https://gerrit.wikimedia.org/r/415776

Change 415754 merged by jenkins-bot:
[operations/mediawiki-config@master] Setup Cirrus AB test

https://gerrit.wikimedia.org/r/415754

Mentioned in SAL (#wikimedia-operations) [2018-03-02T00:09:50Z] <ebernhardson@tin> Synchronized wmf-config/: SWAT: T187148 Configure Cirrus AB test (duration: 01m 00s)

Mentioned in SAL (#wikimedia-operations) [2018-03-02T00:12:34Z] <ebernhardson@tin> Synchronized wmf-config/: REVERT SWAT: T187148 Configure Cirrus AB test (duration: 00m 59s)

Mentioned in SAL (#wikimedia-operations) [2018-03-02T00:23:58Z] <ebernhardson@tin> Synchronized wmf-config/CirrusSearch-common.php: SWAT: T187148 Configure Cirrus AB test (step 1) (second try) (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2018-03-02T00:25:47Z] <ebernhardson@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: T187148 Configure Cirrus AB test (step 2) (second try) (duration: 00m 57s)

Change 415776 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Start Cirrus query explorer AB test

https://gerrit.wikimedia.org/r/415776

Change 415784 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@wmf/1.31.0-wmf.23] Start Cirrus query explorer AB test

https://gerrit.wikimedia.org/r/415784

Change 415784 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@wmf/1.31.0-wmf.23] Start Cirrus query explorer AB test

https://gerrit.wikimedia.org/r/415784

Mentioned in SAL (#wikimedia-operations) [2018-03-02T00:46:26Z] <ebernhardson@tin> Synchronized php-1.31.0-wmf.23/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT T187148: Start cirrus query explorer AB test (duration: 00m 57s)

I ended up having enough time so I started up the test for enwiki, which was the only model trained. I've started up training of the rest and perhaps we can ship a test for them as well.

Mentioned in SAL (#wikimedia-operations) [2018-03-15T23:20:42Z] <ebernhardson@tin> Synchronized php-1.31.0-wmf.25/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT: T187148: Turn off Cirrus AB test (duration: 00m 58s)

Ran feature selection for all the wikis we currently support, some perhaps interesting numbers:

  • 19 wikis
  • 12 features are chosen by all wikis (title_match, redirect_title_match, title_dismax_plain,
  • Out of 244 input features between all wikis 123 of them were used. Should look into what the unused ones are to see if we can prune from the original feature set.
  • Comparing the intersection of features between wikis, we have intersection rates of min: 19 mean: 34 median: 35 max: 45 std: 7.0
  • popularity score and incoming links only selected by 17 wikis
  • Only 12 wikis selected any phrase match features. 11 of those 12 take all 3 phrase match features. Could this be an indication of problems with phrase matching in general?
  • The exact same 12 wikis selected one of the unique terms counts (de, en, fa, fr, id, it, ja, no, pt, ru, vi, zh, leaving a remainder of: ar, fi, he, ko, nl, pl, sv). None of the others use term count features. I'm not sure a feature like term count

I find it interesting that the same wikis that accepted phrase features were also the only ones to take token count features. Not sure what it means though.

EBjune added a subscriber: EBjune.Mar 20 2018, 9:17 PM
  • 12 features are chosen by all wikis (title_match, redirect_title_match, title_dismax_plain,

Interesting, @EBernhardson, what are the other nine features?

  • 12 features are chosen by all wikis (title_match, redirect_title_match, title_dismax_plain,

Interesting, @EBernhardson, what are the other nine features?

Ahh, it got cut off. The full list is: heading_match, redirect_title_plain_min_raw_df, redirect_title_dismax_plain, redirect_title_plain_match, title_match, title_plain_min_raw_ttf, title_plain_match, redirect_title_match, title_dismax_plain, title_plain_min_raw_df, title_min_raw_ttf, redirect_title_plain_min_raw_ttf

So, almost entirely limited to title and redirect fields. The lack of text_* fields is slightly worrying, but checking the selected features every model uses at least 1, usually multiple, features of the text field, just not always the same one.

So, almost entirely limited to title and redirect fields.

That kinda makes sense. Title matches are good matches.

The lack of text_* fields is slightly worrying, but checking the selected features every model uses at least 1, usually multiple, features of the text field, just not always the same one.

I wonder if they are all roughly similar in terms of coverage and value, and so each wiki picks a small subset with a slight edge over the others. It would be interesting—but maybe too much work—to try to figure out whether a smaller set of features to choose from gives more overlap in feature sets and similar performance. Having said that, unless some of the features are a lot more expensive to calculate, it's probably a very unnecessary optimization.

Neat info all around!

Change 422347 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Upgrade enwiki search ranking model to prod

https://gerrit.wikimedia.org/r/422347

Change 422347 merged by jenkins-bot:
[operations/mediawiki-config@master] Upgrade enwiki search ranking model to prod

https://gerrit.wikimedia.org/r/422347

Mentioned in SAL (#wikimedia-operations) [2018-03-27T23:08:01Z] <ebernhardson@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: T187148: Update enwiki search ranking model (duration: 00m 54s)

EBernhardson added a comment.EditedMar 27 2018, 11:24 PM

Trained all wikis except enwiki, dewiki and arwiki against the full feature set to determine if MRMR feature selection returns good results on wikis other than enwiki. The results overall look pretty good. Baseline here refers to the estimated ndcg@10 of the results served to users historically in the click logs. The a_vs_b columns are calculated as (a - b) / (1 - b). MRMR does miss out on some noticable improvement. For example on fawiki MRMR only achieves 0.6912 while the full dataset achieves 0.7051. The worst performer is zhwiki which misses out on about half of the possible improvement through feature selection. not sure why. Overall though MRMR is still capturing a significant portion of the improvement of the full dataset at a fifth the number of features on more datasets so seems reasonable to move forward with it.

wikibaselineminimalMRMRALLbaseline_vs_minimalbaseline_vs_MRMRbaseline_vs_ALL
frwiki0.8311110.86060.87580.87620.1746040.2646040.266973
itwiki0.8413330.86150.87820.87890.1271010.2323530.236764
ptwiki0.7420550.76720.78280.78080.0974820.1579600.150206
ruwiki0.8085090.83460.85470.85570.1362530.2412190.246441
arwiki0.6477040.70830.71620.1720030.194427
fawiki0.6263540.68840.69120.70510.1660560.1735500.210751
fiwiki0.8335710.85230.85180.85890.1125360.1095320.152193
hewiki0.8008440.83720.84520.85400.1825520.2227220.266908
idwiki0.7670530.78330.79330.79070.0697450.1126730.101512
jawiki0.7586670.88300.89410.89370.5151920.5611860.559529
kowiki0.8014100.82050.83150.84040.0961270.1515180.196334
nlwiki0.8528180.87650.87920.88560.1609030.1792480.222731
nowiki0.8680540.88220.88180.88280.1072120.1041810.111760
plwiki0.8633200.88030.88500.88990.1242310.1586180.194468
svwiki0.8605180.85630.85940.8649-0.030243-0.0080180.031414
viwiki0.7366020.76460.79010.78840.1062970.2031080.196654
zhwiki0.8090370.84350.83630.86930.1804710.1427670.315575

Updated above table with training the same dataset using the the older minimal featureset. Some things i notice:

  • On average the minimal feature set achieves 60% of the improvement of the full featureset
  • On average wikis retain 80% of the improvement of the full featureset with MRMR feature selection
  • fiwiki, nowiki and zhwiki all do worse with MRMR than the baseline feature set.no and fi are reasonably close to equal,
  • nowiki in sees very little benefit in additional feature even up to the full 244 feature set.
  • zhwiki shows a healthy improvement with the full featureset over minimal, but MRMR manages to do much worse than even the minimal feature set.

Change 422585 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Configure next Cirrus AB test

https://gerrit.wikimedia.org/r/422585

Change 422585 merged by jenkins-bot:
[operations/mediawiki-config@master] Configure next Cirrus AB test

https://gerrit.wikimedia.org/r/422585

Mentioned in SAL (#wikimedia-operations) [2018-03-28T23:38:18Z] <ebernhardson@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: T187148: Configure next Cirrus AB test (duration: 01m 16s)

I'm a bit behind in my reading! This is very cool stuff.

The a_vs_b columns are calculated as (a - b) / (1 - b)

baseline_vs_minimalbaseline_vs_MRMRbaseline_vs_ALL

I think you've got a and b switched in the labels (or, alternatively the formula, but I think it's the labels).

It's great that you can do this on so many different data sets at once; it gives a much better sense of how effective and reliable it is.

A random though for future explorations: it'd be interesting to see the mRMR values that went into the feature selection for each wiki. Instead of choosing 50 features, maybe choose an mRMR threshold and take features that score above that threshold (looking at the existing scores for the top 50 features for wikis that do well could be a guide). You could of course place some absolute limit (say 75 or 100 features, depending on the performance implications).

My hunch is that some wikis would stop with fewer than 50 features and still get similar scores, while others would take more than 50 and improve. It might also smooth out some of the discrepancies between wikis in terms of features that seem intuitively valuable. For example, maybe one was ranked 53rd in a wiki that didn't used it, but based on an mRMR threshold value, it would be included.

Looking at a ranked list may also highlight new discrepancies. If a feature is in the top 10 in most wikis but ranked 47th in one, it's obviously less valuable to that wiki. That said, the way mRMR works, it isn't necessarily inherently not valuable, it could just be redundant with another feature that scores a bit better.

Again, this is just spitballing for future consideration. Overall, these results look great!

Change 423056 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] Start cirrus query_explorer AB test on 19 wikis

https://gerrit.wikimedia.org/r/423056

Change 423063 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Configure 5 buckets for next Cirrus AB test

https://gerrit.wikimedia.org/r/423063

Change 423063 merged by jenkins-bot:
[operations/mediawiki-config@master] Configure 5 buckets for next Cirrus AB test

https://gerrit.wikimedia.org/r/423063

Mentioned in SAL (#wikimedia-operations) [2018-03-29T23:12:11Z] <ebernhardson@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: T187148: Configure 5 buckets for cirrus AB test (duration: 01m 17s)

Change 423056 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Start cirrus query_explorer AB test on 19 wikis

https://gerrit.wikimedia.org/r/423056

Change 423067 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@wmf/1.31.0-wmf.27] Start cirrus query_explorer AB test on 19 wikis

https://gerrit.wikimedia.org/r/423067

Change 423070 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@wmf/1.31.0-wmf.26] Start cirrus query_explorer AB test on 19 wikis

https://gerrit.wikimedia.org/r/423070

Change 423070 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@wmf/1.31.0-wmf.26] Start cirrus query_explorer AB test on 19 wikis

https://gerrit.wikimedia.org/r/423070

Change 423067 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@wmf/1.31.0-wmf.27] Start cirrus query_explorer AB test on 19 wikis

https://gerrit.wikimedia.org/r/423067

Mentioned in SAL (#wikimedia-operations) [2018-03-29T23:37:54Z] <ebernhardson@tin> Synchronized php-1.31.0-wmf.26/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT: T187148: Start cirrus AB test (duration: 01m 16s)

Mentioned in SAL (#wikimedia-operations) [2018-03-29T23:40:22Z] <ebernhardson@tin> Synchronized php-1.31.0-wmf.27/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT: T187148: Start cirrus AB test (duration: 01m 16s)

Change 427824 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] Revert "Start cirrus query_explorer AB test on 19 wikis"

https://gerrit.wikimedia.org/r/427824

Change 427824 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Revert "Start cirrus query_explorer AB test on 19 wikis"

https://gerrit.wikimedia.org/r/427824

Change 427829 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@wmf/1.31.0-wmf.29] Revert "Start cirrus query_explorer AB test on 19 wikis"

https://gerrit.wikimedia.org/r/427829

Change 427830 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@wmf/1.31.0-wmf.30] Revert "Start cirrus query_explorer AB test on 19 wikis"

https://gerrit.wikimedia.org/r/427830

Change 427829 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@wmf/1.31.0-wmf.29] Revert "Start cirrus query_explorer AB test on 19 wikis"

https://gerrit.wikimedia.org/r/427829

Change 427830 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@wmf/1.31.0-wmf.30] Revert "Start cirrus query_explorer AB test on 19 wikis"

https://gerrit.wikimedia.org/r/427830

Mentioned in SAL (#wikimedia-operations) [2018-04-19T23:13:30Z] <ebernhardson@tin> Synchronized php-1.31.0-wmf.30/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT: T187148: Turn off cirrus ab test (duration: 01m 17s)

Mentioned in SAL (#wikimedia-operations) [2018-04-19T23:16:56Z] <ebernhardson@tin> Synchronized php-1.31.0-wmf.29/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT: T187148: Turn off cirrus ab test (duration: 01m 18s)

Change 430797 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Promote MLR models from AB test to prod

https://gerrit.wikimedia.org/r/430797

debt closed this task as Resolved.Jun 1 2018, 1:47 PM

Change 430797 merged by jenkins-bot:
[operations/mediawiki-config@master] Promote MLR models from AB test to prod

https://gerrit.wikimedia.org/r/430797