Page MenuHomePhabricator

Relax 'AND' operator with the common term query
Closed, ResolvedPublic

Description

when the query contains a lot of words (questions) the default AND is not appropriate because a single missing stopword could hide a good result. We could use the minimum_should_match attribute which allows to force a minimal number term to match (e.g. 90% of the query terms should match).

There's also another interesting query which will do the "stopwords stripping" automagically, it's the common term query [1].
In few words this query is able to detect stopwords by analyzing word freq at query time, so the query:

What's the connection between power laws and zipf distribution
will be split into 2 clauses :

  • connection power laws zipf distribution
  • what's the between and

And we can control the boolean operator of these clauses independently, e.g. OR for high freq words and AND for low freq words. Or even more complex stuff like "3<80%" [2]: if there is more than 3 words only 80% of them are required.

Here's a more readable blog post[3] about Common Terms. And, for reference, ES has stop word lists[4] for >30 languages.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html
[2] https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-minimum-should-match.html
[3] https://www.elastic.co/blog/stop-stopping-stop-words-a-look-at-common-terms-query
[4] https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html

Event Timeline

EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
EBernhardson subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
EBernhardson renamed this task from Relaxing 'AND' operator with the common term query to Relax 'AND' operator with the common term query.Sep 10 2015, 9:09 PM
EBernhardson set Security to None.
Deskana renamed this task from Relax 'AND' operator with the common term query to EPIC: Relax 'AND' operator with the common term query.Sep 14 2015, 8:02 PM
Deskana triaged this task as High priority.
Deskana added a project: Epic.
Deskana subscribed.

Increasing priority based on discussions held earlier today. This may be our best way to tackle the zero results problem that is our current goal

Short summary of the intent of this: if you have enough words (for some definition of "enough"), then switching from ANDs to ORs allows you to find more relevant results by not needing things like "the", "a", i.e. very common words, to be in the result.

Deskana renamed this task from EPIC: Relax 'AND' operator with the common term query to Relax 'AND' operator with the common term query.Sep 14 2015, 8:20 PM
Deskana removed a project: Epic.

Our experimental highlighter does not support CommontermsQuery properly T112746. I think I can workaround this problem and fix the highlighter later.

@TJones I think I need your help :)

This common terms query is available on suggesty and can be activated by adding the &cirrusUseCommonTermsQuery=yes to the query results page.
Currently the query is activated if there is at least 4 words in the query (see default profile in : https://gerrit.wikimedia.org/r/#/c/238767/2/profiles/CommonTermsQueryProfiles.php )

I tested two queries and it seems to work well :

In some cases it looks like relevance is a bit better (for question queries):

I'd like also to find sane defaults values for the profile (cutoff frequency, min should match...)
Any ideas on how we could evaluate this feature will be very welcome :)

This is a bit off-topic but somewhat related, the ? character has a special meaning in our syntax today. So if the query ends with a ? the common terms query won't be activated. I don't know what we should do because queries that ends with a ? are likely questions and not special syntax...

@dcausse, you've asked some tough questions!

I have one easy suggestion: cutoff_freq is much much too high at 1% (0.01). Given the Zipfian distribution, you'll get a handful (<10) words that are 1% or more of your corpus. For example, check out the list of words from Project Gutenberg, sorted by freq, with counts per billion.

Only 8 words are 1% or more. Only 94 words hit the 0.1% threshold (0.001), which is what is used in the ES documentation. So, 0.001 might be a reasonable default. And the cutoff for the strict setting should be higher than (or equal to), not lower than, the default. (Unless "0.01" is just a typo for "0.001" all around.)

I'd take 0.001 as the default, keep 0.001 for strict, and go as low as 0.000[567] for aggressive_recall, depending on how aggressive we are feeling.

Unfortunately, question words may not be as common in articles as they are in questions, and more so depending on the language. But for simplicity's sake, we should stick to raw term frequency for now.

The min_query_terms settings seem reasonable.

I'd lower the high_freq_min_should_match settings, and maybe allow a single high-freq word to be missing, except maybe for strict:

  • default: '0<0 1<50%' (if I've done that right, that's 0 required for 1, and 50% required for 2 or more)
  • strict: '75%' or even '0<0 1<75%' (0 out of 1, or 75% out of 2 or more)
  • aggressive_recall: '0<0 2<25%' (0 out of 1-2, or 25% out of 3 or more)

For low_freq_min_should_match, I like your settings, though I think for aggressive_recall, it should be '2<66%', since that means "for three or more", if I've read the docs right.

I've been a bit conservative here. I'm not against being more aggressive about labeling terms as stop words and requiring fewer of them (possibly none) to be present. The really do contribute so little. I think the low_freq_min_should_match settings are where the good stuff is going to happen. But I'm also happy with baby steps.

Is ? a problem? I searched for Who's Afraid of Virginia Woolf? and Who Framed Roger Rabbit? without any trouble. Help:Searching gives an example using a question, how do clocks work?, so I think we're okay there. (Is ? used other than in regexes?)

Evaluation is going to be hard. Zero Results Rate is easy to measure, but is probably too crude of a measure, as the snail slime example illustrates.

We could rig up a harness that would allow us to automatically detect differences in results from various settings. It won't tell us whether the results are better, but will let us tell whether the settings make any significant difference (because if only 0.05% of queries get a different set of results, it's probably not enough).

As discussed before, we could look at differences only in top 5 results, top 10, or whatever. We could also note re-ordering vs new entries. Depending on the effect size and sample size, we could manually look at results with new results (maybe or maybe not for results that are just re-ordered) to judge the value of the changes. For a 1K sample of queries, if 10% of results are affected, that's 100 queries to examine. (Or we could of course just randomly sample 100 queries to investigate, regardless of effect size.) At 2-3 minutes each (which is slow, I think), that's half a day to review them and see if the results are good or not. It'd be tedious, but doable, and would give us a rough idea of whether the results were obviously better or just low-precision junk.

Of course, an A/B test with the user satisfaction metric would be the best thing, but I don't know if that's going to happen in time.

Let me know if you want me to dig into anything else!

In other cases the results are maybe worse :

  • without how to learn piano
    • Jazz piano is the first result, but Piano pedagogy is 2nd
  • with how to learn piano
    • Piano is first (better than Jazz piano) but piano pedagogy is far behind ...

Thanks Trey!
I'll look into this tomorrow.

In other cases the results are maybe worse :

I definitely think we should retest with cutoff freq at 0.1%, not 1%. Definitely good to find examples that get worse, though!

@dcausse, you've asked some tough questions!

I have one easy suggestion: cutoff_freq is much much too high at 1% (0.01). Given the Zipfian distribution, you'll get a handful (<10) words that are 1% or more of your corpus. For example, check out the list of words from Project Gutenberg, sorted by freq, with counts per billion.

Sure this setting is totally wrong.

Only 8 words are 1% or more. Only 94 words hit the 0.1% threshold (0.001), which is what is used in the ES documentation. So, 0.001 might be a reasonable default. And the cutoff for the strict setting should be higher than (or equal to), not lower than, the default. (Unless "0.01" is just a typo for "0.001" all around.)

I'd take 0.001 as the default, keep 0.001 for strict, and go as low as 0.000[567] for aggressive_recall, depending on how aggressive we are feeling.

Let's use 0.0006 for now

Unfortunately, question words may not be as common in articles as they are in questions, and more so depending on the language. But for simplicity's sake, we should stick to raw term frequency for now.

yes, we'll have to include lexical resources otherwise and I don't know where to put them currently.

The min_query_terms settings seem reasonable.

I'd lower the high_freq_min_should_match settings, and maybe allow a single high-freq word to be missing, except maybe for strict:

  • default: '0<0 1<50%' (if I've done that right, that's 0 required for 1, and 50% required for 2 or more)
  • strict: '75%' or even '0<0 1<75%' (0 out of 1, or 75% out of 2 or more)
  • aggressive_recall: '0<0 2<25%' (0 out of 1-2, or 25% out of 3 or more)

I think the syntax does not support 0<0 1<50%, I'll use 50% and see if it works as expected.

For low_freq_min_should_match, I like your settings, though I think for aggressive_recall, it should be '2<66%', since that means "for three or more", if I've read the docs right.

I've been a bit conservative here. I'm not against being more aggressive about labeling terms as stop words and requiring fewer of them (possibly none) to be present. The really do contribute so little. I think the low_freq_min_should_match settings are where the good stuff is going to happen. But I'm also happy with baby steps.

Is ? a problem? I searched for Who's Afraid of Virginia Woolf? and Who Framed Roger Rabbit? without any trouble. Help:Searching gives an example using a question, how do clocks work?, so I think we're okay there. (Is ? used other than in regexes?)

Look at the result differences :

This is because when ? is in a query term it triggers a wildard query. I'm not sure how to deal with that without breaking the existing syntax...

Evaluation is going to be hard. Zero Results Rate is easy to measure, but is probably too crude of a measure, as the snail slime example illustrates.

We could rig up a harness that would allow us to automatically detect differences in results from various settings. It won't tell us whether the results are better, but will let us tell whether the settings make any significant difference (because if only 0.05% of queries get a different set of results, it's probably not enough).

As discussed before, we could look at differences only in top 5 results, top 10, or whatever. We could also note re-ordering vs new entries. Depending on the effect size and sample size, we could manually look at results with new results (maybe or maybe not for results that are just re-ordered) to judge the value of the changes. For a 1K sample of queries, if 10% of results are affected, that's 100 queries to examine. (Or we could of course just randomly sample 100 queries to investigate, regardless of effect size.) At 2-3 minutes each (which is slow, I think), that's half a day to review them and see if the results are good or not. It'd be tedious, but doable, and would give us a rough idea of whether the results were obviously better or just low-precision junk.

Time has come to build more tools for our relevancy lab :)

Of course, an A/B test with the user satisfaction metric would be the best thing, but I don't know if that's going to happen in time.

Let me know if you want me to dig into anything else!

Another thing that is broken with the current implementation:
We have 2 fields:

  • plain: which is a basic word tokenizer
  • analyzed: which is analyzed with language specific features and produces stems and exclude stopwords

I included both fields in the common terms query but I think I should exclude the analyzed field. And this also explain why some queries are not returning results today even if a single stop word is missing.
I think the query format we use today will generate something like that for what is a stopwords :

(plain:what OR analyzed:what) AND (plain:is OR analyzed:is) AND (plain:a OR analyzed:a) AND (plain:stopwords OR analyzed:stopword)

And will be expanded to :

(plain:what) AND (plain:is) AND (plain:a) AND (plain:stopwords OR analyzed:stopword)

Making all stopwords required.

Change 238767 had a related patch set uploaded (by DCausse):
WIP: Add support for CommonTermsQuery

https://gerrit.wikimedia.org/r/238767

I think the syntax does not support 0<0 1<50%, I'll use 50% and see if it works as expected.

It seems to here in the current documentation for Minimum Should Match (and it's been supported since at least 1.4):

Multiple combinations 2<-25% 9<-3 Multiple conditional specifications can be separated by spaces, each one only being valid for numbers greater than the one before it.

Look at the result differences :

This is because when ? is in a query term it triggers a wildcard query. I'm not sure how to deal with that without breaking the existing syntax...

Aha! I finally figured out what's happening. I had conflated wildcards and regexes and was expecting this to behave like a regex ? (which is does not) rather than a wildcard ?.

The Help:Searching docs are wrong then, in that they don't mention ? as a wildcard, and suggest it as an option for searching for a question!

Time has come to build more tools for our relevancy lab :)

You are never going to hear me disagree with that!

Another thing that is broken with the current implementation:
We have 2 fields:

  • plain: which is a basic word tokenizer
  • analyzed: which is analyzed with language specific features and produces stems and exclude stopwords

I included both fields in the common terms query but I think I should exclude the analyzed field. And this also explain why some queries are not returning results today even if a single stop word is missing.
I think the query format we use today will generate something like that for what is a stopwords :

(plain:what OR analyzed:what) AND (plain:is OR analyzed:is) AND (plain:a OR analyzed:a) AND (plain:stopwords OR analyzed:stopword)

And will be expanded to :

(plain:what) AND (plain:is) AND (plain:a) AND (plain:stopwords OR analyzed:stopword)

Making all stopwords required.

That's just weird, and I don't understand. Isn't Common Terms more sophisticated than just AND-ing and OR-ing , since it allows minimum_should_match? Do we use the analyzed field now? Searching on ''wanted'' or ''wants'' in enwiki does not give results for "want"; seems like the docs are out of date here, too, since stemming: doesn't work, either.

I think the syntax does not support 0<0 1<50%, I'll use 50% and see if it works as expected.

It seems to here in the current documentation for Minimum Should Match (and it's been supported since at least 1.4):

Multiple combinations 2<-25% 9<-3 Multiple conditional specifications can be separated by spaces, each one only being valid for numbers greater than the one before it.

Thanks, I overlooked this section, I'll try it.

Look at the result differences :

This is because when ? is in a query term it triggers a wildcard query. I'm not sure how to deal with that without breaking the existing syntax...

Aha! I finally figured out what's happening. I had conflated wildcards and regexes and was expecting this to behave like a regex ? (which is does not) rather than a wildcard ?.

The Help:Searching docs are wrong then, in that they don't mention ? as a wildcard, and suggest it as an option for searching for a question!

yes the doc is either outdated or wrong :/

Making all stopwords required.

That's just weird, and I don't understand. Isn't Common Terms more sophisticated than just AND-ing and OR-ing , since it allows minimum_should_match? Do we use the analyzed field now? Searching on ''wanted'' or ''wants'' in enwiki does not give results for "want"; seems like the docs are out of date here, too, since stemming: doesn't work, either.

Yes we are using stemming today but with a default AND and apparently this weird behavior of all stopwords being required. But a simple query wants should match also want, I'll try to find an example.

I updated suggesty with all your suggestions and made a special clause with the stemmed field with a default AND. The results are a bit worse, the zipf query still return results but far less :(. Maybe I should experiment with min should match on the stemmed field...

Thanks!

The results are a bit worse...

So glad to be of help! Just kidding. Dang.

This is why we need more relevance lab and a bit less intuition.

Your suggestions are all good :)
I just forgot to mention that it's because my previous implementation was totally wrong, and the interesting facts about kennedy assassinations query was returning too much results. Now after adding a default AND some interesting results are filtered by these not-so-common question words...

Change 238767 merged by jenkins-bot:
Add support for CommonTermsQuery

https://gerrit.wikimedia.org/r/238767