Page MenuHomePhabricator

From Zero To Hero 2: Electric Boogaloo
Closed, ResolvedPublic6 Story Points

Description

Background

In T128118, we performed an analysis of how query features affect the outcome of the query (zero results vs some results) using the variable importance feature of random forest classification. The report inspired Trey to look into the problem of question marks in greater detail, and has led to us stripping question marks from queries (see T133711 for more details). It will be interesting to see which features float up to the top now that we have eliminated the question mark as a major influencer on zero results rate.

Objective

In this task, you will perform an analysis of search queries to check which features affect likeliness of zero results rate. You are welcome to use random forests, logistic regression, and/or any other methodology to answer the question.

Optional

We can also import TSS2 data into Hive to join with the search logs (see P4095 for more details), which would let you investigate the relationship between query features and clickthrough, letting you answer questions like "when users perform advanced searches and get results, do they click more often than users who perform simple searches?" Ask Mikhail to help you with getting the data if you choose to do this.

Tips & Links

Event Timeline

mpopov created this task.Oct 3 2016, 6:05 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 3 2016, 6:05 PM
debt added a comment.Oct 3 2016, 10:49 PM

Let's chat about this during tomorrow's sprint planning meeting! :)

Is there also a task summary available that would allow non-team members to get a very basic idea what this task is about? :)

the background section seems to be a reasonable task summary? although as a team-member i already have a good idea what this is about...

debt added a comment.Oct 4 2016, 7:23 PM

This ticket is the third in a series of 'onboarding' type of tasks for @chelsyx to complete to get her up to speed on things at the Foundation and within the Discovery team.

debt triaged this task as Normal priority.Oct 4 2016, 8:25 PM

A user has a 1 in 66 chance of being selected for search satisfaction tracking according to our TestSearchSatisfaction2 #15700292 schema

Minor comment, not a big deal, but typically sampling for the test search satisfaction 2 schema is 1:200. When we run A/B tests this sampling gets increased and other factors get adjusted such that users in the A/B test have the subTest field set, and users without the subTest field set remain approximatly 1:200 sampled.

debt added a comment.Nov 2 2016, 5:25 PM

Hi @chelsyx - for the 1 in 66 users being selected - when it should have been 1 in 200....does that mean we'll have to revisit this analysis?

Also, I'm not sure if the scatter plot images for the variable importance information - it's a bit hard to read even when viewing at full size. Can that be re-arranged to have the labels be a bit more separated in the image?

@debt No, we don't have to revisit the analysis. Thanks @EBernhardson ! :)
I will try to fix the labels too!

@debt Do I need to create a pdf of this report and upload it to commons?

debt added a comment.Nov 4 2016, 1:57 PM

Hi @chelsyx - yes, we'll need to get it into a pdf format as well. But, let's give everyone a chance to take a look at the second draft first.

TJones added a subscriber: TJones.Nov 8 2016, 5:06 PM

This looks great! I sent some minor suggestions to @chelsyx by email.

One interesting thing I noticed is that increased query length—measured in characters or terms—stands out as being fairly predictive of poor performance, which makes sense.

Since we don't have a small, fixed user base, we can't easily educate everyone who uses CirrusSearch about the things they can do to make their searches better, so we try to do those things for them. We now ignore "normal" question marks instead of treating them as wild cards, we are talking about automatically removing quotes and re-searching when quoted queries get no results.

A couple of other ideas that spring to mind:

  • We probably only want to take a search with quotes and re-do it without quotes when it gets zero results, but we could offer a "would you like to search without quotes" link when there are few results (< 3 has been our standard cutoff).
  • For long queries, we could try to offset the long-query problem by re-issuing the query using Common Terms or something similar, at least for "simple" queries (i.e., with no special syntax).
  • Alternatively, for long, simple queries that get few results (say, < 3 again) we could have a short, carefully worded suggestion on how to pull out important terms, maybe with a link to a little lesson about what words are useful to search and which are not, which could in turn link to full search docs. (Translations are an issue here, too, I know.)

Do people read docs? I don't know. But it seems that we have two basic options to improve search results: improve the search engine (which we try to do all the time) and improve the queries (by teaching the searchers). If we can detect features of searches that are likely to cause poor results, we should try to automatically fix them when we can (e.g., removing quotes) and try to teach people how to generate better queries when automatic fixes are too complex (e.g., when the query is very long).

</soapbox>

Very instructive thanks!

Now I'm very curious about the number of terms/chars features you added in the report. Would it be possible (not in this report) to have fine grained data about this? I think this would help to design what Trey suggests.

minor comment to the report:

  • maybe we should indicate the scope of the data? was it from all wikis/languages or a particular wiki?

Thank you!

Third draft is up: https://wikimedia-research.github.io/Discovery-Search-QueryFeatures-201610/

@mpopov @debt , if you all approve, I will start to make a PDF version of this report.

@TJones and @dcausse, thanks for your comments! The report has been modified! :)

@dcausse, here are part of the break downs by number of features, number of terms, and number of characters:

Query TypeNumber of Features per QueryQueries with some resultsQueries with zero resultsQueries
full_text186.8%13.2%731.09K
full_text276.1%23.9%2.53K
full_text375.2%24.8%500
full_text454.5%45.5%11
full_text510.0%90.0%10
Query TypeNumber of Terms per QueryQueries with some resultsQueries with zero resultsQueries
full_text176.9%23.1%196.67K
full_text292.6%7.4%261.71K
full_text391.7%8.3%131.87K
full_text489.6%10.4%65.91K
full_text586.6%13.4%34.61K
full_text683.2%16.8%18.05K
full_text779.7%20.3%10.28K
full_text876.2%23.8%5.9K
full_text971.7%28.3%3.66K
full_text1063.8%36.2%1.85K
full_text1160.4%39.6%1.08K
full_text1261.2%38.8%683
full_text1357.3%42.7%445
full_text1454.8%45.2%336
full_text1547.5%52.5%200
full_text1649.7%50.3%151
full_text1739.0%61.0%118
full_text1835.3%64.7%102
full_text1940.0%60.0%70
full_text2042.3%57.7%52
Query TypeNumber of Characters per QueryQueries with some resultsQueries with zero resultsQueries
full_text0100.0%0.0%5
full_text186.9%13.1%711
full_text292.5%7.5%5.06K
full_text385.8%14.2%11.16K
full_text484.3%15.7%18.07K
full_text587.3%12.7%22.13K
full_text686.5%13.5%26.27K
full_text785.1%14.9%29.44K
full_text882.6%17.4%30.63K
full_text982.8%17.2%33.09K
full_text1084.6%15.4%35.89K
full_text1187.5%12.5%38.99K
full_text1289.4%10.6%40.45K
full_text1390.1%9.9%40.03K
full_text1490.3%9.7%38.83K
full_text1590.3%9.7%35.91K
full_text1690.3%9.7%32.89K
full_text1790.1%9.9%30.11K
full_text1890.3%9.7%27.22K
full_text1990.0%10.0%24.02K
full_text2089.6%10.4%21.41K
full_text2189.4%10.6%19.7K
full_text2288.9%11.1%17.52K
full_text2388.1%11.9%15.68K
full_text2488.5%11.5%14.01K
full_text2587.7%12.3%12.65K
full_text2686.3%13.7%11.29K
full_text2786.5%13.5%10.12K
full_text2886.1%13.9%9.16K
full_text2985.7%14.3%8.14K
full_text3183.5%16.5%6.53K
full_text32100.0%0.0%4.81K
full_text33100.0%0.0%4.22K
full_text34100.0%0.0%3.85K
full_text35100.0%0.0%3.48K
full_text36100.0%0.0%2.99K
full_text37100.0%0.0%2.73K
full_text38100.0%0.0%2.36K
full_text39100.0%0.0%2.04K
full_text40100.0%0.0%1.98K
debt added a comment.Nov 9 2016, 3:55 PM

HI @dcausse and @TJones - does the above detailed data help enough or would you like more information?

Thanks for the quick work, @chelsyx, we'll take a look at version 3!

TJones added a comment.EditedNov 9 2016, 5:01 PM

Hi @chelsyx! @dcausse and @EBernhardson and I have been talking about this and have a few questions:

  • We wanted to verify that this across all wikipedias (all projects?) and not just English or just English Wikipedia.
    • Assuming this is multi-lingual data, in the future, would it be possible to break it down by language of the project for some list of languages (top 10, plus a few "interesting" non-Latin ones)? (We're not trying to add more work for this report.)
  • In the list by "Number of Terms per Query", how did you tokenize terms? We're wondering if Chinese, Japanese, and other spaceless languages have multi-word queries that are being counted as one term.
  • Under "Number of Characters per Query" what's going on when you get to 32+? They all have 0% ZRR. Is that right?
  • Under "Number of features"...
    • Trey wants to know if it is possible to not count "is_simple" as a feature? It's really "has no other features", right?
    • David is concerned about how "has_non-ASCII" interacts with non-Latin languages (Russian, Chinese) and languages that have lots of diacritics (French, German). Not sure what to do about it, though.

Hi @TJones, @dcausse and @EBernhardson !

The full tables are too long, so I put them here: https://github.com/wikimedia-research/Discovery-Search-QueryFeatures-201610/blob/master/EDA_nterm_nchar_nfeat.md
Links to individual tables:
Breakdown by Number of Features per Query (Not counting "is simple")
Breakdown by Number of Terms per Query
Breakdown by Number of Characters per Query

To answer your questions:

  • We wanted to verify that this across all wikipedias (all projects?) and not just English or just English Wikipedia.

Yes. Here are the ZRR by languages and projects in this sample: https://github.com/wikimedia-research/Discovery-Search-QueryFeatures-201610/blob/master/EDA_nterm_nchar_nfeat.md#zrr-by-languages-and-projects

  • Assuming this is multi-lingual data, in the future, would it be possible to break it down by language of the project for some list of languages (top 10, plus a few "interesting" non-Latin ones)? (We're not trying to add more work for this report.)

I tried adding language as a training feature into the model. To my surprise, it didn't improve the model performance while increase the computation time hugely, so I exclude it in the final model. Perhaps some preprocessing on the language may help (e.g. as you suggested, pick the top 10, plus a few "interesting" non-Latin ones).

  • In the list by "Number of Terms per Query", how did you tokenize terms? We're wondering if Chinese, Japanese, and other spaceless languages have multi-word queries that are being counted as one term.

All queries are split by spaces in order to count the number of terms. And yeah... it's not a correct way for Chinese and other spaceless languages, and many multi-word queries were counted as one term: https://github.com/wikimedia-research/Discovery-Search-QueryFeatures-201610/blob/master/EDA_nterm_nchar_nfeat.md#proportion-of-searches-by-number-of-terms-per-query-in-chinese-and-english

  • Under "Number of Characters per Query" what's going on when you get to 32+? They all have 0% ZRR. Is that right?

No... I'm so sorry I made some mistakes yesterday. This is the correct one: https://github.com/wikimedia-research/Discovery-Search-QueryFeatures-201610/blob/master/EDA_nterm_nchar_nfeat.md#breakdown-by-number-of-characters-per-query

  • Under "Number of features"...
    • Trey wants to know if it is possible to not count "is_simple" as a feature? It's really "has no other features", right?

Yes. I updated the table: https://github.com/wikimedia-research/Discovery-Search-QueryFeatures-201610/blob/master/EDA_nterm_nchar_nfeat.md#breakdown-by-number-of-features-per-query-not-counting-is-simple

  • David is concerned about how "has_non-ASCII" interacts with non-Latin languages (Russian, Chinese) and languages that have lots of diacritics (French, German). Not sure what to do about it, though.

Queries having non-ASCII by Languages: https://github.com/wikimedia-research/Discovery-Search-QueryFeatures-201610/blob/master/EDA_nterm_nchar_nfeat.md#queries-having-non-ascii-by-languages

Please let me know if you have any other questions! :)

@chelsyx , thanks so much for all this—and for pulling it together so quickly! I'm glad to see our hypotheses mostly being borne out in the data.

The spike in 2-word queries over 1-word, and the higher ZRR for 1-word queries are a bit unexpected, and thus more interesting.

The space-based skew of Chinese vs English is what we expected, but still nice to see.

David suggested possibly setting up some sort of service that would allow you to tokenize queries (at least as well as Elasticsearch does), which could give better word counts for some spaceless languages. It wasn't clear how easy it would be, but it's definitely something we could look at it if it's something of interest.

And thanks for putting together the non-ASCII table based on our not at all specific concerns. That'll give us something to chew on for a while, too.

Very much appreciate all of this!

You are very welcome @TJones ! :)

@chelsyx: Third draft looks great! Good job!

Thanks @mpopov !

@debt, here is the pdf version of the report:

. I will upload it to commons if it looks good to you. :)

Looks great, @chelsyx - upload away!