The plugin can generate a variety of term statistic based features, try them out and see which are useful to include in training.
|Open||None||T174064 [FY 2017-18 Objective] Implement advanced search methodologies|
|Resolved||EBernhardson||T161632 [Epic] Improve search by researching and deploying machine learning to re-rank search results|
|Resolved||EBernhardson||T162279 Collect ideas for feature engineering of LTRank|
|Resolved||EBernhardson||T187148 Evaluate features provided by `query_explorer` functionality of ltr plugin|
|Resolved||EBernhardson||T188015 Increase ltr.cache.max_size in Cirrus elasticsearch clusters|
Built out feature collection for these, the dataset sizes are a bit larger than expected. xgboost datasets for enwiki when maintaining 35M observations is roughly 50GB. Interestingly the lightgbm dataset for the same data is only 4GB. Similarly while xgboost needed 10 executors with 16G memory each to run training, lightgbm looks like it will only require 1 executor with 10G of memory. Might be worth evaluating if the ltr plugin can correctly evaluate lightgbm trees (if we convert them to xgboost format).
Training is being done against the 20180118 dataset, for which we have baselines and estimates of training gain variance in T186134. This gives ndcg@10 of 0.8391 +- 0.0005.
First attempt at training xgboost I've also found it's extremely slow. 10 executors with 10 cores each is taking about 30 minutes to train a model. On the upside the first two CV jobs run came back with ndcg@10 of 0.8525 and 0.8510, or approximately 8% of the possible improvement. This is only two cv runs so better models may be found. Due to the compute necessary to train a model we will not be able to train anywhere near the 150 hyperparameter rounds we typically use.
For next steps I'm going to evaluate some of the trained models and see if there are large amounts of features we can drop and still retain most of the improvement.
After evaluating found the previous iteration was missing all the features related to the plain fields due to a bug. Re-collected the data and came up with 0.8627. Unfortunately there is something odd going on at evaluation time. I uploaded the 0.8628 model as [[ https://en.wikipedia.org/wiki/Special:Search?search=kennedy&fulltext=1&cirrusMLRModel=20180118-query_explorer_v2-enwiki-v1 | 20180118-query_explorer_v2-enwiki-v1 ] and it takes on average around 2s to return results. Taking the query being issued here and running it against codfw comes back at 300ms. 300ms is stlil probably too expensive, but 2s is amazing. The eqiad cluster load doesn't seem high enough (~25% cpu, minimal io) to make the queries 6x slower. Unfortunately it looks like the ltr plugin deployed in prod doesn't work with the elasticsearch profile api and throws an exception.
Even if we fix the 2s query, 300ms is still much more expensive than the current query, so investigated feature selection. To evaluate feature selection i used the spark-infotheoretic-feature-selection package. This required discretizing all the features into integers 0-255. The package has a suggested package for doing this, but I let it run for over 2 hours against a dataset of 35Mx250 and it gave no signs of completing any time soon. Instead i applied spark's QuantileDiscretizer to make 255 approx (boundary accuracy of 1/2550) equally sized bins and discretized that way. This took only a few minutes to run rather than hours. The package offers multiple feature selection algorithms, i applied them all asking for the top 50 features (20%) and trained an xgboost model for 10 hyperparameter iterations giving the following results:
|Mutual Information Maximization (MIM)||0.8509|
|Mutual Information FS (MIFS)||0.8501|
|Joint Mutual Information (JMI)||0.8513|
|Minimum Redundancy Maximum Relevance (MRMR)||0.8624|
|Conditional Mutual Information Maximization (CMIM)||0.8509|
|Informative Fragments (IF)||0.8507|
I find it interesting they are mostly clustered around 0.85, except mrmr which manages to retain almost the same score as the model with 250 features at 0.8624. This is also not just an anomaly of the hyperparameter randomness, all except 1 iteration of mrmr was > 0.86.
Some investigation into the selected featuresets, a shared feature count matrix:
From here we can see if and cmim returned same feature sets, as well as mim and mifs. mrmr clearly selected a completely different set of features than everything else. At least in this round mrmr is a clear winner, will need to repeat for other wikis and against other datasets (drawn from a different time period) to see if we can simply set it to mrmr going forward or if we need to try and select a feature selection algorithm for each individual model.
The feature set chosen (in no particular order) was: title_match, title_stddev_raw_df, title_min_raw_ttf, title_plain_match, title_dismax_plain, title_plain_min_raw_df, title_plain_stddev_classic_idf, title_plain_min_raw_ttf, redirect_title_match, redirect_title_plain_match, redirect_title_dismax_plain, redirect_title_plain_min_raw_df, redirect_title_plain_min_raw_ttf, heading_match, heading_stddev_classic_idf, heading_plain_match, heading_dismax_plain, heading_plain_min_raw_df, heading_plain_min_raw_ttf, opening_text_mean_raw_ttf, opening_text_plain_match, opening_text_dismax_plain, text_sum_classic_idf, text_min_raw_ttf, text_plain_match, text_dismax_plain, category_match, category_min_raw_ttf, category_plain_match, category_dismax_plain, category_plain_min_raw_df, category_plain_mean_classic_idf, category_plain_min_raw_ttf, auxiliary_text_plain_match,auxiliary_text_dismax_plain, auxiliary_text_plain_sum_classic_idf, auxiliary_text_plain_min_classic_idf, auxiliary_text_plain_stddev_classic_idf, suggest_match, all_near_match, redirect_or_suggest_dismax, all_phrase_match, all_plain_phrase_match, all_phrase_match_dismax_plain, title_unique_terms, title_plain_unique_terms, title_unique_terms_diff_plain, popularity_score, incoming_links, text_word_count
Breakdowns of feature by document field:
|title||category||heading||redirect.title||text||auxiliary_text||opening_text||all_phrase||query term counts||suggest||all_near_match||redirect or suggest||popularity||incoming_links|
By feature type:
|match||dismax||raw_ttf||classic_idf||raw_df||unique term counts||popularity score||incoming links||text word count|
By feature relationship to query/doc. Perhaps notable we have only 3 doc only features and it took all of them.
|query only||doc only||q + d|
I've also now created a limited feature set with just what the mrmr model needs and uploaded it to the prod clusters. Query time on the codfw cluster looks really good, around 125ms. Query time on the eqiad cluster on the other hand is hanging out around 750ms which is unacceptable. Going to have to dig into why the queries take significantly different amounts of time on the different clusters. I'm suspicious it's not actually feature collecting, but maybe model caching? This is because if i adjust the query to only have 6 results, so only 6 possible items could have been provided to feature collecting across 7 shards, i still see ~750ms. This suggests some sort of constant cost rather than a per-item cost like feature collection.
After some testing my theory is the performance on eqiad is due to to our ltr model cache size. It 's currently set to the default of 10MB. If i run the ltr model in a series of requests at only 1 concurrent query i get poor performance. Ramping up to 5 concurrent queries per-query performance is still poor but improving. At 15 concurrent queries things seem to generally improve, with nothing over 500ms in the last few dozen of the 1k. My interpretation of this is that the more likely the model is to be in the cache the faster the request, and the model is being evicted very quickly. Unfortuntaely the cache size is not dynamically updatable, we have to do a rolling restart to update it. I've filed a bug upstream for this. Additionally i don't have stronger proof the cache is undersized because we only report number of items and size in bytes for cache stats, nothing about loads or invalidations from which trends over time could indicate churn. I've filed an upstream bug about that as well.
I ended up having enough time so I started up the test for enwiki, which was the only model trained. I've started up training of the rest and perhaps we can ship a test for them as well.
Ran feature selection for all the wikis we currently support, some perhaps interesting numbers:
- 19 wikis
- 12 features are chosen by all wikis (title_match, redirect_title_match, title_dismax_plain,
- Out of 244 input features between all wikis 123 of them were used. Should look into what the unused ones are to see if we can prune from the original feature set.
- Comparing the intersection of features between wikis, we have intersection rates of min: 19 mean: 34 median: 35 max: 45 std: 7.0
- popularity score and incoming links only selected by 17 wikis
- Only 12 wikis selected any phrase match features. 11 of those 12 take all 3 phrase match features. Could this be an indication of problems with phrase matching in general?
- The exact same 12 wikis selected one of the unique terms counts (de, en, fa, fr, id, it, ja, no, pt, ru, vi, zh, leaving a remainder of: ar, fi, he, ko, nl, pl, sv). None of the others use term count features. I'm not sure a feature like term count
I find it interesting that the same wikis that accepted phrase features were also the only ones to take token count features. Not sure what it means though.
Ahh, it got cut off. The full list is: heading_match, redirect_title_plain_min_raw_df, redirect_title_dismax_plain, redirect_title_plain_match, title_match, title_plain_min_raw_ttf, title_plain_match, redirect_title_match, title_dismax_plain, title_plain_min_raw_df, title_min_raw_ttf, redirect_title_plain_min_raw_ttf
So, almost entirely limited to title and redirect fields. The lack of text_* fields is slightly worrying, but checking the selected features every model uses at least 1, usually multiple, features of the text field, just not always the same one.
So, almost entirely limited to title and redirect fields.
That kinda makes sense. Title matches are good matches.
The lack of text_* fields is slightly worrying, but checking the selected features every model uses at least 1, usually multiple, features of the text field, just not always the same one.
I wonder if they are all roughly similar in terms of coverage and value, and so each wiki picks a small subset with a slight edge over the others. It would be interesting—but maybe too much work—to try to figure out whether a smaller set of features to choose from gives more overlap in feature sets and similar performance. Having said that, unless some of the features are a lot more expensive to calculate, it's probably a very unnecessary optimization.
Neat info all around!
Trained all wikis except enwiki, dewiki and arwiki against the full feature set to determine if MRMR feature selection returns good results on wikis other than enwiki. The results overall look pretty good. Baseline here refers to the estimated ndcg@10 of the results served to users historically in the click logs. The a_vs_b columns are calculated as (a - b) / (1 - b). MRMR does miss out on some noticable improvement. For example on fawiki MRMR only achieves 0.6912 while the full dataset achieves 0.7051. The worst performer is zhwiki which misses out on about half of the possible improvement through feature selection. not sure why. Overall though MRMR is still capturing a significant portion of the improvement of the full dataset at a fifth the number of features on more datasets so seems reasonable to move forward with it.
Updated above table with training the same dataset using the the older minimal featureset. Some things i notice:
- On average the minimal feature set achieves 60% of the improvement of the full featureset
- On average wikis retain 80% of the improvement of the full featureset with MRMR feature selection
- fiwiki, nowiki and zhwiki all do worse with MRMR than the baseline feature set.no and fi are reasonably close to equal,
- nowiki in sees very little benefit in additional feature even up to the full 244 feature set.
- zhwiki shows a healthy improvement with the full featureset over minimal, but MRMR manages to do much worse than even the minimal feature set.
I'm a bit behind in my reading! This is very cool stuff.
The a_vs_b columns are calculated as (a - b) / (1 - b)
baseline_vs_minimal baseline_vs_MRMR baseline_vs_ALL
I think you've got a and b switched in the labels (or, alternatively the formula, but I think it's the labels).
It's great that you can do this on so many different data sets at once; it gives a much better sense of how effective and reliable it is.
A random though for future explorations: it'd be interesting to see the mRMR values that went into the feature selection for each wiki. Instead of choosing 50 features, maybe choose an mRMR threshold and take features that score above that threshold (looking at the existing scores for the top 50 features for wikis that do well could be a guide). You could of course place some absolute limit (say 75 or 100 features, depending on the performance implications).
My hunch is that some wikis would stop with fewer than 50 features and still get similar scores, while others would take more than 50 and improve. It might also smooth out some of the discrepancies between wikis in terms of features that seem intuitively valuable. For example, maybe one was ranked 53rd in a wiki that didn't used it, but based on an mRMR threshold value, it would be included.
Looking at a ranked list may also highlight new discrepancies. If a feature is in the top 10 in most wikis but ranked 47th in one, it's obviously less valuable to that wiki. That said, the way mRMR works, it isn't necessarily inherently not valuable, it could just be redundant with another feature that scores a bit better.
Again, this is just spitballing for future consideration. Overall, these results look great!