Page MenuHomePhabricator

Analyze results of the second BM25 test
Closed, ResolvedPublic10 Estimated Story Points

Event Timeline

The test for ja and zh has been turned off today - we can start looking at that data.

However, the test for th is going to run a bit longer to gather enough data.

In the analysis for the first BM25 test, @mpopov use Levenshtein (edit) distance adjusted by overlapped results to identify query reformulation. However, since most Chinese words only contain 1-3 characters, Levenshtein distance is not suitable for computing distance (Japanese has the same issue as well, not sure about Thai though).

Using lexical similarity to identify query reformulation is another option, but I'm not sure whether it make sense to our BM25 test. I'm also thinking about using the number or queries within each search sessions to approximate the times of query reformulation. @mpopov @TJones @EBernhardson any comments or suggestions? Thanks!

@chelsyx: Good catch on the edit distance and Chinese. I believe @mpopov had to play around with the parameters a bit to find good ones for English query reformulation detection, so the English params may or may not make sense for other languages in general, and certainly not for Chinese and Japanese. Edit distance might work with Thai in general because of the way the characters are encoded in Unicode, but the best settings would not be the same as those for English.

I suggest skipping the whole query reformulation aspect of the report for now. It was a very cool new metric that @mpopov added, and in general we should think about ways to make it meaningful for other languages and scripts, but we are in a bit of a hurry for this analysis. Earlier today in the Sprint Planning meeting, @Deskana suggested we move forward with Deployment Plan B if the analysis wasn't going to be done this week—and we'd prefer to know whether Plan A is viable; also Plan B is a ton more work.

(FYI: Plan A is to turn on BM25 for all projects. Plan B is to turn it on everywhere except for spaceless languages. Plan B is a real pain because, as I understand it, we can do configs by project—all Wikipedias, all Wiktionaries, etc—but not by language, so there'd be a whole lot of configs to configure to make Plan B work.)

If ZRR, PaulScore, User Engagement, and First Clicked Result’s Position all point in the same direction, the Query Reformulation info, while very interesting, isn't necessary to choose between Plan A and Plan B.

Thanks @TJones ! I will finish the analysis as soon as possible.

For query reformulation, indeed everything is different in these three languages as compared to english in our first test. From various literature i've read here are few somewhat generic rules i've seen used:

  • If two queries from the same user share at least 1 result (need to join against CirrusSearchRequestSet to get the list of returned results) consider it a reformulation
  • If two queries share at least one "word" consider it a reformulation

For determining if two queries share a word you should be able to use the elasticsearch termvectors[1] api to tokenize, which you can query from stat1002. Unfortunately you wont have access to our loadbalancer from there, but you can query directly.

To be honest though, this will have mixed results. I tried a few examples copy/pasted from headings on zh/ja/th, with what look like mixed results (since i don't know any of these languages):

Example:

curl -XPOST http://elastic2020.codfw.wmnet:9200/zhwiki_content/page/_mtermvectors -d '{
        "docs": [
                {
                        "doc": { "title": "は5戦して未勝利" },
                        "fields": ["title", "title.plain"],
                        "positions": true,
                        "offsets": false,
                        "term_statistics": false,
                        "field_statistics": false
                }
        ]
}'

simplified results, values in parenthesis are via google translate. also note that the below are in a bit of a random order, but elasticsearch returns the original positions so you can reorder if desired.

zh: 历史上的今天 (today in history)
comments: looks like a simple unigram model, but the per-character translations seem sane-ish.
tokens: [
  "上", (on)
  "今", (this)
  "历", (calendar)
  "史", (history)
  "天", (day)
  "的" (of)
]

zh: https://zh.wikipedia.org/wiki/2008年夏季奥林匹克运动会
comments: not showing the tokens here, but ran the entire text content through and only got unigrams back here as well

ja: 新しい画像 (new image)
comments: looks like a bigram model, may also not be so useful
Tokens: [
  "い画", (painting)
  "しい", (asthma)
  "新し", (new)
  "画像" (image)
]

ja: ジャマイカ国立図書館 (national library of japan)
comments: bigrams again
tokens: 
    "イカ", (squid)
    "カ国", (country)
    "ジャ", (ja)
    "マイ", (my)
    "ャマ", (huma)
    "図書", (book)
    "国立", (national)
    "書館", (library)
    "立図" (statue)

thai:  เรื่องจากข่าว (the news)
comments: looks better, try a more complex one
tokens: [
  "ข่าว", (news)
  "เรื่อง" (subject
]

thai: บทความคัดสรรเดือนนี้ (this months featured article)
comments: not bad
tokens: [
  "คัด", (cull)
  "บทความ", (the article)
  "สรร", (appropriations)
  "เดือน" (month)
]

skipping query reformulation is also reasonable :)

fwiw, i had seen recently that spaCy (python NLP) had added the jieba library to tokenize chinese, maybe better (but this is just because i was interested, you don't need to do all this :)

zh: 历史上的今天
tokens:
    历史
    上
    的
    今天

zh: 2008年夏季奥林匹克运动会
    2008
    年
    夏季
    奥林匹克运动会

Thank you so much @EBernhardson ! Let me do the analysis first without query reformulation as @TJones suggested, in order to provide some information for the decision between Plan A and Plan B, then see what I can do with the tokenizer. :)

mpopov set the point value for this task to 10.Nov 16 2016, 11:22 PM

I replicated @mpopov 's analysis without query reformulation for this second test, and the results are here: https://wikimedia-research.github.io/Discovery-Search-2ndTest-BM25_jazhth/
For the test group, ZRR is lower, but Clickthrough rate is significantly lower and PaulScores are slightly lower.

I replicated @mpopov 's analysis without query reformulation for this second test, and the results are here: https://wikimedia-research.github.io/Discovery-Search-2ndTest-BM25_jazhth/
For the test group, ZRR is higher, but Clickthrough rate is significantly lower and PaulScores are slightly lower.

Thanks for the analysis. It sounds like BM25 is worse on every front for those languages, then. I guess we can turn it on everywhere except for zh, th, and ja.

@Deskana, oops sorry I made a mistake in the last comment: ZRR of test group is actually lower. Fixed it.

@Deskana, oops sorry I made a mistake in the last comment: ZRR of test group is actually lower. Fixed it.

No worries. :-)

What is your recommendation with BM25 on these wikis? The report does not state this explicitly.

ZRR of test group is actually lower. Fixed it.

@dcausse predicted the ZRR would go down because, if I recall correctly, we aren't searching for phrases anymore, so any page that randomly has all the same characters as the query in it will be returned.

It sounds like BM25 is worse on every front for those languages, then. I guess we can turn it on everywhere except for zh, th, and ja.

It would be all the spaceless languages, actually. David's already started configuring it, but it includes bo, dz, gan, ja, km, lo, my, th, wuu, zh, zh-classical, zh-yue, and the mixed space/spaceless bug, cdo, cr, hak, jv, zh-min-nan. More details in T149717.

@chelsyx, would it be possible to break down the stats by language? I have some hope for Thai because ElasticSearch does have a Thai tokenizer and we do use it.

What is your recommendation with BM25 on these wikis? The report does not state this explicitly.

We could make it the PM's decision. ;)

On the one hand we have about 15% of queries getting crappy results instead of nothing, a small but noticeable decrease in the overall quality of results (PaulScore), and a moderate drop in user engagement. First clicked result is "better"—but that may be that people click on the first result of crappy results, just hoping for something. (Alternatively, the ZRR decrease is a good thing—but the PaulScore and user engagement numbers say that it is not.)

On the other hand, we have a complex and potentially brittle configuration since we have to configure "not BM25" for every project for the spaceless languages. If a new project comes up in a spaceless language (say a new Wiktionary), it may get BM25 turned on anyway because we didn't configure it not to, and there's no per language config.

On the other other hand, we could just deploy BM25 everywhere and take the moderate but not devastating hit in search quality for these languages. It hurts, but it's easy to do.

On the fourth hand, we could not reindex projects in these languages right away, on the assumption that @dcausse (or someone) will come up with a reasonable BM25-related approach for these languages—whether that's a better search plan, or refactoring the config to allow per language configs—in a reasonable amount of time (a quarter?)

I dunno, just spitballing.

PS/EDIT: And of course—thanks for the quick write up, Chelsy!

@chelsyx, @EBernhardson and I chatted about this briefly after the backlog grooming meeting. Putting on my analyst hat on for a minute...

Chelsy informs me that the engagement percentage is calculated relative to the total number of people who got results, rather than the total number of queries. That means that the decrease in zero results rate actually indirectly effects the clickthrough rate. After visualising it (see picture below), I realised that a lower clickthrough rate can therefore result in more people clicking on results, if the corresponding decrease in the zero results rate outweighs it! Counterintuitive.

Example:
TFIDF: 72% of queries get results, 42% of those click through -> 30% of total queries result in clickthroughs (0.72 * .42)
BM25: 89% of queries get results, 37% of those click through -> 33% of total queries result in clickthroughs (.89 * .37)

This means, in our test, more BM25 users actually clicked through than TFIDF users because more of them got results, even though the relative clickthrough rate of the BM25 users was lower.

Putting my product hat back on, it's hard to interpret whether that means users are happier or not. There's two hypotheses:

  • It's irrelevant that the absolute clickthrough rate is higher because users will click on random results if we give them some.
  • The absolute clickthrough rate being higher is more important because more users are satisfied with the results they were given.

In summary, I think we need to use our user engagement metric (clickthrough/dwell) rather than just clickthrough to figure out whether BM25 is better for these wikis. If the engagement metric is the same or higher, it should mean more people are satisfied as a result of the BM25 switchover. If it's lower, then people are clicking the random stuff we're serving them.

IMG_20161117_154203.jpg (3×4 px, 2 MB)

@TJones , I just added the breakdown by wikis: https://wikimedia-research.github.io/Discovery-Search-2ndTest-BM25_jazhth/
Looks like zhwiki have the largest discrepancies between control and test group for all metrics... Maybe that has something to do with the tokenizer? At least from this example, the Chinese tokenizer works bad...

zh: 历史上的今天 (today in history)
comments: looks like a simple unigram model, but the per-character translations seem sane-ish.
tokens: [
  "上", (on)
  "今", (this)
  "历", (calendar)
  "史", (history)
  "天", (day)
  "的" (of)
]

@chelsyx thanks!

Yes results look to be dependent on tokenizer behavior.
Chinese is the sole language in this test where we do not have a custom analysis chain. We emit only unigrams.
For ja and th we have custom tokenizers so that could be a probably cause.

I was not aware that CTR was run only on queries that returned at least one result, it means that the volume of clics is higher with BM25. Why do we compute CTR like that? Well I suppose that if we computed CTR based on the number of queries it would then show that BM25 has higher CTR which wouldn't necessarily be a good indication...

I tend to agree with Dan that more data is needed but I have the feeling that the massive drop in ZRR cannot be without major drawbacks...

Thanks for the by-wiki breakdown, @chelsyx. It doesn't really change things, but it does make me feel better to know that it isn't the case that Japanese and Thai actually did okay but Chinese was so bad that it brought the whole average down.

I added the dwell time of visit page and proportion of visit with scroll: https://wikimedia-research.github.io/Discovery-Search-2ndTest-BM25_jazhth/#dwell_time_per_visit_page

Still working on the query reformulation...

Thanks for the update, Chelsy!

I'm not sure what to make of the query reformulation numbers. The decrease in reformulations and the slightly better survival curve indicates people like the results they are getting on some dimension. The decreased engagement could be from showing more crappy results, which are ignored. As long as we can rank results well, returning more crappy results (i.e., lowering the ZRR suspiciously much) isn't terrible, though it isn't great.

Hopefully a new Chinese analyzer will help make things unambiguously less bad. :)

Hey @TJones, @mpopov and @dcausse - would you have time to take a quick run though of this new report? Thanks!

@chelsyx thanks!

I've reviewed the analysis and it looks great.
I have only one nitpick but please ignore it for this analysis. The control group is sometimes on the left sometimes on the right, I was a bit confused the first time but rapidly realized that colors are consistent: control is always red.

I would add in the conclusion that this test revealed how bad we perform on spaceless languages: some CirrusSearch components are dependent on the presence of spaces.
Secondly I'd add that we agree that the current behavior (forcing a proximity match on every query) is far from ideal but given the outcome of the test we decided not to move forward without prior work on this class of languages: better tokenization + track/fix all the components that make the bold assumption of the presence of a space to activate/deactivate features.

Thanks!

Sorry for not reviewing sooner, Chelsy. Overall, everything looks great!

I started reading on my phone, and it looks good on mobile (not that that’s a huge concern, but it’s very nice).

I agree with David about keeping control and test aligned with left and right. The color does solve the problem once you figure it out.

Is it worth noting in the background section that for ZRR smaller is usually better, or generally better, which is then expanded on a bit in the ZRR section comment about it coming at the expense of relevance?

For PaulScore—since that’s not a well known metric—is it worth adding a bit of explanation about why F=0.9 is more likely to be not significant? In particular, that the smaller the value, the more weight put on only the first or first few results, with lower ranking results counting for almost nothing. F=0.9 takes into account the broadest range of result rankings, and thus is less likely to change as dramatically.

I’d add an additional comment in the Conclusion explaining why low ZRR is bad, something like this, maybe?

We emit only unigrams, so any page that randomly has all the same characters as the query in it will be returned. This can greatly decreases ZRR without any increase in search result quality or relevance. In English this would be roughly similar to returning any page that had all the same letters as the query.

Thanks for sticking with us through this complicated A/B test!

Reviewed. First draft looks good! PR submitted on GH.

@chelsyx: Good catch on the edit distance and Chinese. I believe @mpopov had to play around with the parameters a bit to find good ones for English query reformulation detection, so the English params may or may not make sense for other languages in general, and certainly not for Chinese and Japanese.

Yup! For future reference so it's on the record: I picked out like a dozen difference search sessions and looked at the dendrogram of the resulting clustering and picked out which threshold worked the best such that actually similar searches would be together, without pulling in too many searches that are numerically close but actually not.

@chelsyx: it looks good to me, very nice work congrats! (thanks for regenerating the graphs and the new conclusion)

@chelsyx: What David said. Looks good! The new graphs are great.

@chelsyx: Looks great! Only minor change I'd suggest is to elevate the percentages on Figs 16 & 17 a little bit so they're not right on top of the bars, and then it's done! :D

@chelsyx Thanks! Please also upload the report to Wikimedia Commons; it's important our reports go there so that they're easy to access and refer to from our sites. Let me know if you need assistance with that and I'd be happy to provide it. :-)