Page MenuHomePhabricator

Investigate tuning CirrusSearch parameters with optimisation algorithms
Open, MediumPublic


I was just radomly looking over some things related to optimization techniques, and found that particle swarm optimization is a potential way to optimize non-differentiable equations (like our search relevance metrics). Might be interesting to investigate if this could be useful.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 24 2016, 9:51 PM
EBernhardson added a comment.EditedOct 24 2016, 9:59 PM

I ran a relatively basic test using pyswarm, as it was relatively easy (~15minutes) to hack into the existing minimization code we have in relevance forge.

I used the following config:

"wgCirrusSearchFullTextQueryBuilderProfiles": {
    "relforge": {
        "builder_class": "CirrusSearch\\Query\\FullTextSimpleMatchQueryBuilder",
        "settings": {
            "default_min_should_match": "1",
            "default_query_type": "most_fields",
            "default_stem_weight": 3,
            "fields": {
                "title": %(x0)f,
                "redirect.title": {
                    "boost": %(x1)f,
                    "in_dismax": "redirects_or_shingles"
                "suggest": {
                    "is_plain": true,
                    "boost": %(x2)f,
                    "in_dismax": "redirects_or_shingles"
                "category": %(x3)f,
                "heading": %(x4)f,
                "text": {
                    "boost": %(x5)f,
                    "in_dismax": "text_and_opening_text"
                "opening_text": {
                    "boost": %(x6)f,
                    "in_dismax": "text_and_opening_text"
                "auxiliary_text": %(x7)f,
                "file_text": %{x8}f
            "phrase_rescore_fields": {
                "all": %(x9)f,
                "all.plain": %(x10)f

I ran using mostly default values, although i set minstep=1e-4, and minfunc=1e4, which sets the minimum step size of the input variables and the minimum stepsize of the minimized function, before it quits. The algorithm also quits when it has completed 100 iterations (default value). Unfortunately i didn't turn on the debug output so i'm not sure if this quit because it hit one of those limits, or if it ran out of iterations...

variablecurrent valueminimized valuebounds used
title boost0.30.51603304[0.1, 0.6]
redirect.title boost0.270.49507159[0.01, 0.6]
suggest boost0.200.39897446[0.01, 0.40]
category boost0.050.13269138[0.01, 0.20]
heading boost0.050.08133649[0.01, 0.20]
text boost0.60.55966426[0.1, 1.2]
opening text boost0.50.337877[0.1, 1.0]
auxiliary text boost0.050.09787259[0.01, 0.20]
file text boost0.50.16166894[0.01, 1.0]
phrase rescore all0.030.06453687[0.01, 0.10]
phrase rescore all.plain0.050.08298389[0.01, 0.20]

This optimization resulted in ERR@10 increasing from 0.4126 to 0.4245. This is a relatively small increase that i'm not sure we could apply an AB test to and get an improvement with p=0.05, but it's interesting.

I've got another run going which increases the bounds on suggest to [0.01,0.80], removes file_text from optimizing (because our discernatron data probably doesn't use it), and adds in wgCirrusSearchIncLinksAloneW to the values to be optimized. It's still running but i've already seen ERR@10 values as high as 0.4334 go by.

Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptOct 24 2016, 10:17 PM
EBernhardson added a comment.EditedOct 24 2016, 10:45 PM

Apparently the last round did say why it stopped, i simply didn't notice...

This round finished after the objective (ERR@10) changed less than 0.0001 in an iteration.

Config optimized:

"wgCirrusSearchFullTextQueryBuilderProfiles": {
        "relforge": {
            "builder_class": "CirrusSearch\\Query\\FullTextSimpleMatchQueryBuilder",
            "settings": {
                "default_min_should_match": "1",
                "default_query_type": "most_fields",
                "default_stem_weight": 3,
                "fields": {
                    "title": %(x0)f,
                    "redirect.title": {
                        "boost": %(x1)f,
                        "in_dismax": "redirects_or_shingles"
                    "suggest": {
                        "is_plain": true,
                        "boost": %(x2)f,
                        "in_dismax": "redirects_or_shingles"
                    "category": %(x3)f,
                    "heading": %(x4)f,
                    "text": {
                        "boost": %(x5)f,
                        "in_dismax": "text_and_opening_text"
                    "opening_text": {
                        "boost": %(x6)f,
                        "in_dismax": "text_and_opening_text"
                    "auxiliary_text": %(x7)f,
                    "file_text": 0.5
                "phrase_rescore_fields": {
                    "all": %(x8)f,
                    "all.plain": %(x9)f
    "wgCirrusSearchIncLinksAloneW": %(x10)f

Current ERR@10: 0.4126
Optimized ERR@10: 0.4354

variablescurrent valueminimized valuebounds used
title boost0.30.55271783[0.1, 0.6]
redirect.title boost0.270.33213544[0.01, 0.6]
suggest boost0.200.57204283[0.01, 0.80]
category boost0.050.09392002[0.01, 0.20]
heading boost0.050.02322809[0.01, 0.20]
text boost0.60.25845596[0.1, 1.2]
opening text boost0.50.1[0.1, 1.0]
auxiliary text boost0.050.12883963[0.01, 0.20]
phrase rescore all0.030.02518373[0.01, 0.10]
phrase rescore all.plain0.050.02788314[0.01, 0.20]
inclinks alone W6.20877494[2, 8]
debt triaged this task as Low priority.Oct 25 2016, 5:47 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.
MaxSem moved this task from Needs triage to Search on the Discovery board.Oct 26 2016, 9:23 PM
Deskana renamed this task from Investigate tuning CirrusSearch parameters with particle swarm optimization to Investigate tuning CirrusSearch parameters with optimisation algorithms.Dec 6 2016, 6:21 PM
Deskana raised the priority of this task from Low to Medium.Dec 6 2016, 6:44 PM
Deskana added a subscriber: Deskana.

Bumping up priority; this is a relatively experimental thing which is worth exploring more since we're making excellent progress on our goals.

Getting the necessary hive query into an acceptable state was a giant pain, but it looks to finally be done. Required a little help from the analytics team, and a bunch of trial and error to figure out why hive wasn't reading the files it created (it seems there is a bug where having array<struct<timestamp:timestamp,...>> errored out. Changing the timestamp to and integer unix timestamp finally fixed it). The individual hive scripts all seem to work well, and mostly done with the oozie integration to collect the necessary data on a regular basis. Final step for the above patch is to actually run the oozie workflows and find all the paramters i forgot to set across the 3 files that control an oozie coordinator+workflow.

Next steps:

  • Update scripts that calculate DBN relevance to use the new format. This code is mostly written already in my batch learn 2 rank code ( in the code/ and code/ scripts).
  • Temporarily install the l2r plugin on relforge. Setup features for each piece of our standard fulltext query, run a few thousands queries that were scored by the DBN to collect the feature vectors. Will need to write some custom code to read the generated logs and filter down to page id's we have DBN relevance results for, but that's relatively easy.
  • Feed those results into a linear ML model, either an SVM from SVMRank or Coordinate Ascent from RankLib. Both use the same input format, which I've also already written code to handle my batch learn 2 rank code referenced above.

Both of those models spit out linear weights, such that they can be directly used as boosts. Final step will be to configure these boosts in relforge and run an existing metric on them, probably NDCG from our discernatron data. Could also run Kendall's Tau using the DBN data, or normalize the DBN data into 4 point relevance grades. As long as the data used for these final tests is a hold-out from the initial training it should be a valid test.

Change 317019 had a related patch set uploaded (by EBernhardson):
[WIP] Calculate click data for top queries

Change 327855 had a related patch set uploaded (by EBernhardson):
UDF for extracting primary full text search request

Full pipeline is built on, and i've tested it on a set of 10k normalized queries. Trained Coordinate Ascent models (because they are linear, and can be directly applied to boosts in our es queries), along with two LambdaMART models just to get a comparisson. The complete dataset was split into 2/3 train, 1/6 test, 1/6 validation. The validation set was not seen at all during the training, and was used for this final evaluation. Not sure those splits are the best, but its a first start. Because CA has randomness to it i ran through 7 training session (with 20 random restarts each). The two LambdaMART models are far from perfect, likely need to spend some time with hyper-parmeter optimization to get something reasonable.

There are some potential problems here

  • Each unique query we saw from users is represented once. For our 10k normalized queries, this is ~95k total queries. On average each query is represented 9.5 times, but there is variance.
  • It might be nice to apply weighting, but RankLib doesn't natively support that. We would have to duplicate the feature vectors with a new query id for each instance, which would take quite a bit more memory. Currently training takes ~5GB of memory. Duplicating with weights would be at least 35x more data. Perhaps could do 1/10th weighting or something though
  • We could probably come up with a number of query-independant features to add in here that may, or may not, improve the scoring. Query-independent ones can be calculated at index time, but requires coming up with appropriate mappings(not too hard), reindexing relforge (time consuming, but doable), and coming up with ES queries that extract the appropriate values(mostly done when coming up with mappings). Some examples:
    • wp10 scores
    • # of redirects (log2, log10, sqrt?)
    • # of headings
    • bm25 of query against outgoing links (needs mapping changes)
    • bm25 of query against external links
    • there are probably plenty of useful derivative features as well (ratio of feature A to feature B, like (# headings / words in document), etc.)

Running a set of Coordinate Ascent models currently, will run some analysis and report on overall changes in NDCG, and some stats comparing individual queries that ranklib can calculate. Will also take the best model or two from that analysis, set the weights on relforge-search, and run our discernatron data through.

The following is the output (lightly massaged into phabricator tables) of a number of different algorithms in RankLib with our prod feature set. CA is Coordinate Ascent, a linear model we should be able to translate directly into ES boost values. It's non-deterministic so i ran it 20x to see if there would be much variance, there isn't. Came up with pretty much the same score each time with slight differences.

Features used

  • all_near_match
  • auxiliary_text_match
  • category_match
  • file_text_match
  • heading_match
  • incoming_links_satu
  • phrase_match
  • popularity_score_satu
  • redirect_and_suggest_dis_max
  • template_featured_article
  • template_featured_list
  • template_featured_picture
  • template_featured_sound
  • template_good_article
  • text_and_opening_text_dis_max
  • text_word_count_log2p (not a prod feature, but easy to add if useful)
  • title_match

Overall comparison

CA_PROD.model.txt [baseline]0.7733
AdaRank_NDCG@10_001.model.txt0.737-0.0363 (-4.69%)330561110.0
AdaRank_NDCG@10_linearnorm_001.model.txt0.7918+0.0185 (+2.39%)555944410.0
AdaRank_NDCG@10_sumnorm_001.model.txt0.7835+0.0102 (+1.31%)515650550.0
AdaRank_NDCG@10_zscorenorm_001.model.txt0.799+0.0257 (+3.32%)568545020.0
AdaRank_NDCG@10_zscorenorm__tol0.0001_001.model.txt0.7932+0.0199 (+2.57%)561350230.0
AdaRank_NDCG@10_zscorenorm__tol0.0005_001.model.txt0.7932+0.0199 (+2.57%)561350230.0
CA_NDCG@10_000.model.txt0.8027+0.0293 (+3.79%)556538790.0
CA_NDCG@10_001.model.txt0.8024+0.0291 (+3.76%)559838720.0
CA_NDCG@10_002.model.txt0.8025+0.0291 (+3.77%)558037830.0
CA_NDCG@10_003.model.txt0.8021+0.0288 (+3.72%)563838390.0
CA_NDCG@10_004.model.txt0.8025+0.0292 (+3.77%)558437940.0
CA_NDCG@10_005.model.txt0.8019+0.0286 (+3.7%)563039310.0
CA_NDCG@10_006.model.txt0.8025+0.0292 (+3.77%)561039740.0
CA_NDCG@10_007.model.txt0.8022+0.0289 (+3.73%)555939490.0
CA_NDCG@10_008.model.txt0.8025+0.0292 (+3.77%)557138040.0
CA_NDCG@10_009.model.txt0.8026+0.0292 (+3.78%)556837770.0
CA_NDCG@10_010.model.txt0.8023+0.0289 (+3.74%)557038680.0
CA_NDCG@10_011.model.txt0.802+0.0286 (+3.7%)559138520.0
CA_NDCG@10_012.model.txt0.8027+0.0293 (+3.79%)560938610.0
CA_NDCG@10_013.model.txt0.8026+0.0293 (+3.78%)556337690.0
CA_NDCG@10_014.model.txt0.8022+0.0288 (+3.73%)554938140.0
CA_NDCG@10_015.model.txt0.802+0.0286 (+3.7%)554138190.0
CA_NDCG@10_016.model.txt0.8023+0.029 (+3.75%)562138540.0
CA_NDCG@10_017.model.txt0.8024+0.029 (+3.75%)556938050.0
CA_NDCG@10_018.model.txt0.8026+0.0292 (+3.78%)560538360.0
CA_NDCG@10_019.model.txt0.8023+0.029 (+3.75%)559537620.0
LambdaMART_125tree_linearnorm_NDCG@10_001.model.txt0.8265+0.0531 (+6.87%)634040620.0
LambdaMART_125tree_linearnorm_NDCG@10_002.model.txt0.8265+0.0531 (+6.87%)634040620.0
LambdaMART_125tree_sumnorm_NDCG@10_001.model.txt0.8038+0.0305 (+3.94%)576646600.0
LambdaMART_125tree_zscorenorm_NDCG@10_001.model.txt0.8228+0.0495 (+6.4%)626041580.0
ListNet_NDCG@10_zscorenorm_001.model.txt0.7956+0.0223 (+2.88%)568545710.0
RankBoost_NDCG@10_001.model.txt0.7874+0.014 (+1.81%)535949860.0
RankNet_NDCG@10_1layer_20node_zscorenorm_001.model.txt0.7997+0.0263 (+3.4%)572045400.0
RankNet_NDCG@10_linearnorm_001.model.txt0.7822+0.0088 (+1.14%)523849900.0
RankNet_NDCG@10_zscorenorm_001.model.txt0.7993+0.026 (+3.36%)570745890.0
RankNet_NDCG@10_zscorenorm_2layer_001.model.txt0.8025+0.0292 (+3.77%)570143640.0

Detailed break down

[ < -100%)[-100%, -75%)[-75%, -50%)[-50%, -25%)[-25%, 0%)(0%, +25%](+25%, +50%](+50%, +75%](+75%, +100%]( > +100%]

The CA models are similar enough choosing one in particular doesn't seem particularly easy. Below are the weights for CA_NDCG@10_013.model.txt, which has the smallest number of 'loser' queries, that got worse with the new weights. These are a little hard to compare to our current prod weights, as they are normalized such that the sum of weights is 1, so i summed the prod weights (11.93) and divided them all by it to get norm prod weight. Not that while popularity and incoming links look to have high weights, they arn't as high as they seem because their maximum value is much lower than the other features (and no feature normalization is applied, to match using the weights as boosts).

featureweightnorm prod weightprod weight

The weighting on category_match is very surprising to me, but i triple checked and that's what keeps coming out. file_text_match should be disregarded, because we only pulled queries from the content index nothing has file_text to match.

EBernhardson added a comment.EditedJan 5 2017, 6:39 AM

For curiosity sake, i worked up some code to output the data in a format suitable for LightGBM (a more advanced gradient boosting library than RankLib that also does LambdaMART) to see if there was much difference. Unfortunately I forgot to put a random seed on the train/validation/test splits in the previous run so i couldn't recreate the exact same sets to run lightgbm against. Instead i re-ran the prod weights, and re-trained a RankLib LambdaMART model against the new splits.

That problem with this result though, is that LightGBM doesn't offer any native feature normalization. I'm going to have to work up something to pre-process the features similar to RankLib before feeding them into LightGBM.

Tangentially, its perhaps interesting that on a different data split LambdaMART pulled a 15% improvement over the prod weights. The CA models take significantly longer to train (they are basically brute force searches), but are showing around 80.6 on NDCG@10, so it's not that a linear combination can't improve, this data split just happens to have poorly performing queries in the test set.

Overall comparison

CA_PROD.model.txt [baseline]0.718
LambdaMART_125tree_linearnorm_NDCG@10_001.model.txt0.8263+0.1083 (+15.08%)841730740.0
LightGBM_100tree_NDCG@10_001.model.txt0.7686+0.0506 (+7.04%)865736020.0

Detailed break down

[ < -100%)[-100%, -75%)[-75%, -50%)[-50%, -25%)[-25%, 0%)(0%, +25%](+25%, +50%](+50%, +75%](+75%, +100%]( > +100%]

Re-ran the above LightGBM training with a simple re-implementation of linear normalization of features from ranklib (value - min) / (max - min). Surprisingly the results are almost the same, actually very slightly worse. At least in this case ranklib seems to be out-performing lightgbm (tuning hyper parameters may change that, they can make a big difference in the resulting trees, but since this ticket is about training a linear model I'm not going to investigate further).

Applied the above weights (CA_NDCG@10_013.model.txt) and ran through relevance forge to get PaulScore (using 10k sessions with satisfied clicks), along with nDCG and ERR numbers using 150 queries from Discernatron. I'll go back and review to make sure i have all the weights set appropriately and that they mean something, but the differences here are relatively minor and go both ways. I found the report from RankLib interesting, so I formatted the results from relevance forge such that they could be run through the same analysis.

The Discernatron data showed a minor increase in NDCG/ERR scores, but the p-values suggest the result isn't significant. I'm not sure how big a sample would be required to measure small changes in NDCG, but 150 queries obviously isn't enough.

PaulScore goes the other way with an equally minor change, although this time due to the larger (10k session) sample size the p-values show significance across the board. These all show the new values as slightly worse. We've never been able to validate the Paul Score as having predictive power for online testing, but this isn't particularly promising.

Overall comparison

old_ndcg@1 [baseline]0.461
new_ndcg@10.4678+0.0067 (+1.46%)1390.7667
old_ndcg@2 [baseline]0.4142
new_ndcg@20.4141-1.0E-4 (-0.02%)29260.9964
old_ndcg@3 [baseline]0.3936
new_ndcg@30.3994+0.0058 (+1.48%)36350.6016
old_ndcg@4 [baseline]0.3907
new_ndcg@40.3884-0.0023 (-0.58%)44400.821
old_ndcg@5 [baseline]0.3857
new_ndcg@50.3767-0.0091 (-2.36%)44450.3066
old_paulscore@0.10 [baseline]0.4881
new_paulscore@0.100.4755-0.0126 (-2.58%)144916630.0
old_paulscore@0.50 [baseline]0.5459
new_paulscore@0.500.5328-0.013 (-2.39%)163218740.0
old_paulscore@0.70 [baseline]0.5921
new_paulscore@0.700.5789-0.0132 (-2.23%)162318850.0
old_paulscore@0.90 [baseline]0.6717
new_paulscore@0.900.6584-0.0134 (-1.99%)162318850.0

Detailed break down

[ < -100%)[-100%, -75%)[-75%, -50%)[-50%, -25%)[-25%, 0%)(0%, +25%](+25%, +50%](+50%, +75%](+75%, +100%]( > +100%]
TJones added a subscriber: TJones.Feb 15 2017, 4:35 PM
debt added a subscriber: debt.Mar 8 2017, 8:28 PM