Page MenuHomePhabricator

Evaluate the "boosted all field"
Closed, ResolvedPublic

Description

The "all field" has been introduced to resolve performance issues because it allows a full text query to run on 2 fields (all and all.plain) vs 14 (title, title.plain, redirect ...) when disabled.
It has been identified as problematic with some queries:

  • kennedy : JFK ranked #12 with the all field and #2 with the all field disabled.
  • T107666 : check this ticket for more technical details
  • T116706 : disabling the all fields resolve most of the examples listed in this tickets

The problem is that the all field does not play well with the core lucene scoring functions. Cirrus tries to boost the score if the searched terms appear in the title. This boost is somewhat cancelled by the all field in certain condition (if the searched term is not very rare).

We should re-evaluate the benefits of the all field. If performance is still a concern and the all field cannot be avoided for performance reasons we could work on workarounds to compensate the title boosting issue.

Event Timeline

dcausse raised the priority of this task from to Needs Triage.
dcausse updated the task description. (Show Details)
dcausse subscribed.
Deskana set Security to None.

We should carefully check that not using the all field does not introduce problems:

  • performance
  • carefully check that coord is not the reason why the all field was considered better at the time. It's probable that coord might have been an issue and could explain why the default weight for title is very high.
    • not using the all field will generate more query clauses, coord is a factor that helps to rank higher pages that match most of the query clauses, ex: a page with a match in category, template, auxiliary_text could be ranked higher than a page with a match in the title and content because of coord.

This task is also going to serve as a first proof-of-concept and bug-finding exercise for the more general process of testing these configuration changes in the Relevance Lab.

We're going to pull together three corpora for testing: two focused for the problems and one that's more general for regression testing:

  • ~1K one-word queries with 500+ results
  • ~1K multiword queries with 500+ results
  • ~1K regression test set

Queries with 500+ results are the main target of this change, and one-word and multiword queries are processed differently internally, so we want to have query sets that allow us to see the impact the change has on these kinds of queries.

The regression test set is a collection of full text queries randomly sampled from a full day of enwiki queries. We'll use this to gauge the overall impact of the change, and to look for unexpected side effects outside the target queries (those with 500+ results).

We're starting with enwiki, but we'll generate several regression test sets of different sizes and for different wikis, and targeted test sets for the larger wikis, too, depending in part on where we can run tests (suggesty vs elsewhere) and what indexes are available.

Our metric for what qualifies as a change will be queries that have results moving into or out of the top 3 results. If there aren't many of those for some reason, we might expand to the top 5.

We'll manually review 100 randomly sampled changed queries and assess whether the changes are better, worse, or neither and make recommendations (e.g., deploy, A/B test, abandon). While that won't give precise proportions, the confidence interval is small enough to get a sense of overall performance (e.g., 1% worse, 70% better is great; 15% worse, 5% better is terrible; 1% worse and 1% better might not be worth pursuing.)

@ksmith asked me to look into this task for the possibility of running an A/B test. Right now, I'm not certain what the A/B test here would be for, so I'm not comfortable spinning off tasks for an A/B test just yet. Let's discuss this at some point so I understand this better.

tl;dr - I think that the all field resolved obvious scoring problems but the way it handles boosts is sub-optimal resulting in pages with a title match being ranked badly.

I'm not sure that we should run an A/B test, after looking at some queries the all field seems to a good idea (both for performance and scoring):

  • performance: because we run a query over 2 fields vs 14 fields
  • scoring: because the way the query is built when the all field is not activated is very hazardous

(jump to conclusion if you don't care about the details)

Full details:

1/ Why scoring can be really bad without the all field
QueryString uses DisMax by default to build the lucene query. DisMax will simply take the best score over all fields for each word. For example a query like history france:
The best score for history will be taken across the 7*2 fields (title, redirect, opening_text, text, category, auxiliary_text * corresponding plain fields) and the best for france in those same fields.

The problem is that some of these fields have length normalization disabled:

  • title: yes
  • redirect: no
  • text: yes
  • category no
  • ...

This is problematic because redirect has a very high boost and without norms its score can be very high, dismax will probably chose redirect over other fields frequently.
Scores are likely to be very dependent on the number of redirects with a very high variation between documents.
Ignoring norms is a bit dangerous in our case, redirects can sometimes be used for many different purpose, and if the query includes stopwords it can have very bad consequences, articles like Ecclesiastical_Province_of_British_Columbia_and_Yukon can be overboosted because of the number of redirects :

For the query the province :

Ecclesiastical Province of British Columbia and Yukon - 2235.7048
The Ecclesiastical Province of British Columbia and Yukon is one of four ecclesiastical provinces in the Anglican Church of Canada. It was founded in 1914
2235.7048 = Rescore product primW=1 secW=1
  911.29834 = Rescore sum primW=1 secW=10
    13.328976 = Bool coord=0.5
      26.657951 = Bool
        8.536828 = DisMax best=redirect.title.plain:the
-->       8.536828 = TFIDF term=redirect.title.plain:the^15.0 tf=17(freq=289) idf=4.5915995 qNorm=4.5915995 fNorm=1 <-- no norms, with ridiculously high freq!
        18.121124 = DisMax best=redirect.title.plain:province
          18.121124 = TFIDF term=redirect.title.plain:province^15.0 tf=10.34408(freq=107) idf=8.57604 qNorm=8.57604 fNorm=1

Because of the very high number of redirects this doc is even better than "The Province" which has an match to the all_near_match...
See: the province (without the all field) vs. the province (with the all) field

2/ Why scoring is sometimes bad with the all field
By combining all the fields into the same field we needed to provide a way to boost words in a specific field (if the word appears in the title it should have more impact than if it appears in the body text). The trick was to copy the boosted field content multiple times, i.e. if we configure the title field with a boost of 20 we will copy 20 times its content to the all field. It will affect the raw tf value in the similarity formula. But it will also affect the length norm, by duplicating content the all field becomes artificially longer resulting in a lower length norm.
But it appears that the length norm effect can have more impact than the increased raw tf.
For example the query einstein the score for the Albert Einstein page on enwiki and the all.plain field is 0.639 (freq: 602 but with a very low norm at 0.0068), it's way lower than the same score for the page Einstein refrigerator with 1.1656908 (freq=125 but higher norm at 0.027). In spite of tf being 5 times higher for the Albert Einstein the page Einstein refrigerator has a score for the all.field twice higher.
The page Albert Einstein is the top result only because of the redirect Einstein and the all_near_match query which boosts the score up to 46 :

Albert Einstein - 273.16702
"Einstein" redirects here. For other uses, see Albert Einstein (disambiguation) and Einstein (disambiguation). Albert Einstein (/ˈaɪnstaɪn/; German: [ˈalbɛɐ̯t
273.16702 = Rescore product primW=1 secW=1
  46.218018 = Bool
    0.6395384 = DisMax best=all.plain:einstein
-->  0.6395384 = TFIDF term=all.plain:einstein tf=24.535688(freq=602) idf=7.7079268 qNorm=0.06417932 fNorm=0.0068359375 <-- Very low norm!
    45.57848 = TFIDF term=all_near_match:einstein tf=3.8729835(freq=15) idf=13.541274 qNorm=0.06417932 fNorm=1

Einstein refrigerator - 1.2544514
The Einstein–Szilard or Einstein refrigerator is an absorption refrigerator which has no moving parts, operates at constant pressure, and requires only
1.2544514 = Rescore product primW=1 secW=1
  0.5828454 = Bool coord=0.5
    1.1656908 = DisMax best=all.plain:einstein
-->  1.1656908 = TFIDF term=all.plain:einstein tf=11.18034(freq=125) idf=7.7079268 qNorm=0.06417932 fNorm=0.02734375

The same comparison with the text.plain field shows a better normalization (no more factor 2 between the 2 scores):

Albert Einstein - 110.43612
"Einstein" redirects here. For other uses, see Albert Einstein (disambiguation) and Einstein (disambiguation). Albert Einstein (/ˈaɪnstaɪn/; German: [ˈalbɛɐ̯t
110.43612 = Rescore product primW=1 secW=1
  18.685284 = Bool
    9.675292 = DisMax best=redirect.title.plain:einstein
-->  0.014188072 = TFIDF term=text.plain:einstein tf=18.681541(freq=349) idf=7.9261518 qNorm=0.001547376 fNorm=0.0078125
    9.009991 = DisMax best=redirect.title.near_match:einstein

Einstein refrigerator - 8.231416
The Einstein–Szilard or Einstein refrigerator is an absorption refrigerator which has no moving parts, operates at constant pressure, and requires only
8.231417 = Rescore product primW=1 secW=1
  3.824495 = Bool coord=0.5
    7.64899 = DisMax best=redirect.title.plain:einstein
-->   0.013691541 = TFIDF term=text.plain:einstein tf=3.6055512(freq=13) idf=7.9261518 qNorm=0.001547376 fNorm=0.0390625

It's still unclear to me what's the real reason behind the difference in scores when using the all field, it can be because we append a lot of content like auxiliary_text or because we duplicate some field content...

Conclusion?
It seems to me that it's all about proper normalization, without the all field absence of normalization on certain field can cause weird behaviors. With the all fields it seems that length normalization has too much impact.

Easy (sub-optimal) fixes:

  • We could add another query clause to the title directly. It does not resolve the root of the problem but can help in some cases
  • Compensate the norm impact by adding a rescore clause on article size

Proper fix:
We should look into proper normalization, the problem today with the current similarity we use is that we can't really control the norm impact. This is a nice feature of BM25 where we can control and optimize the parameter b which will affect the norm impact (lucene defaults vs BM25).
We could also in theory try to add norms to redirect and see how it plays but I think it'd be better to investigate into BM25.

NOTE: this is the first time I look into our scores, I may have overlooked something...

So what should our next tasks be?

I'd like to go for proper fixes where possible, it seems we could test changing the similarity module on codfw to BM25 and running an AB test where we ship users in one group to codfw. Maybe (probably) running that through relevance lab first to get an idea of what the effect is. One blocker on doing a test with user queries shipped to codfw is that we are not (yet) encrypting that traffic. But it is a work in progress. Changing the similarity module doesn't require a reindex, just closing and reopening the index. Should be easy enough to do.

The sub-optimal fixes seem pretty easy, but are they worthwhile to do when we have proper fixes available? IMO create a task for BM25 and try it out if that seems like the best option.

@EBernhardson : talking to Trey yesterday we decided to go for this short-term plan:

  • Fix the non-allfield usage by enabling norms on all fields (redirects, ...). WMF installation won't benefit from this fix but it seems to be an obvious problem that we need to address. Since norms is a big array in mem we should maybe look into using lazy loading so we won't consume mem in our cluster.
  • For the allfield problem we could bundle some tests with pageviews and/or boostlinks by adding a rescore clause on the number of tockens, it could help to workaround the impact of norms.

Mid-term:

  • Concerning the proper fix I'm still unsure if we are ready to go for BM25, my concern is that I'm not really sure that BM25 will help right now. Boosting field with the copy_to hack does not seem to be a good idea, some queries are really bad when they don't match to the all_near_match :

I think that we can't really tune the system with the all field, it requires a reindex whenever we change boost values and it seems to me that it's pretty hazardous to play with the raw tf.
My plan would be to resolve the main full text search query:

  • Use the all field as a preliminary filter for fast retrieval and maybe drop positions to save space? (phrase rescore on the allfield seems to have problems that would be hard to address)
  • Add secondary clauses to specific fields we want to boost
  • Review the phrase rescore on the all field

This would allow us to fine-tune the system at query time, I've made quick test by rewritting the fulltext (by using the shingles in the suggest field):

Note that this query buider has also obvious problems, it certainly overboost titles... e.g. the province newspaper will drop interesting result like Hewitt Bostock (founder)

Basically my problem is: with the allfield I'm unable to tune these queries, and I'm pretty sure that BM25 + the allfield won't fix "history france".
Other problems I encountered was due to the phrase rescore which seems to be overboosted, e.g. if it matches the name of a category then most article whithin this category are likely to be top result. Again by using an all in one field I don't know how we can fine-tune the system :/

I'm sorry but it's still unclear in my mind :)

I've created few tasks to follow-up from what we've learned here:
Short-term:

  • T128061: Enable norms on all fields that are used for scoring (fix non wmf installations that do not use the all field)
  • T128070: Experiment with a rescore profile that boost articles according to their size (workaround length norm with the all field)
  • T128071: Evaluate the phrase rescore (this can cause weird behaviors in some case we should maybe reconsider the phrase rescore on the all field)

Mid/Long-term EPICS (not added to the sprint):

  • T128073: EPIC: experiment with a new fulltext query (try to address the all field problem and regain control on boost configuration)
  • T128076: EPIC: Evaluate the indexing strategy and try to make more benefits from the semi-structured content we have (Not directly related, but can be an issue when we'll want to e.g. add specific boosts to categories)

I havn't added yet anything concerning BM25...