Relax 'AND' operator with the common term query
Closed, ResolvedPublic
Actions

Description

when the query contains a lot of words (questions) the default AND is not appropriate because a single missing stopword could hide a good result. We could use the minimum_should_match attribute which allows to force a minimal number term to match (e.g. 90% of the query terms should match).

There's also another interesting query which will do the "stopwords stripping" automagically, it's the common term query [1].
In few words this query is able to detect stopwords by analyzing word freq at query time, so the query:

What's the connection between power laws and zipf distribution
will be split into 2 clauses :

connection power laws zipf distribution
what's the between and

And we can control the boolean operator of these clauses independently, e.g. OR for high freq words and AND for low freq words. Or even more complex stuff like "3<80%" [2]: if there is more than 3 words only 80% of them are required.

Here's a more readable blog post[3] about Common Terms. And, for reference, ES has stop word lists[4] for >30 languages.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html
[2] https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-minimum-should-match.html
[3] https://www.elastic.co/blog/stop-stopping-stop-words-a-look-at-common-terms-query
[4] https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html

Details

	Subject	Repo	Branch	Lines +/-
	Add support for CommonTermsQuery	mediawiki/extensions/CirrusSearch	master	+1 K -279

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T241265 Search find a section name but not a page name
Open	None	T343148 Relax 'AND' operator in search queries
Declined	None	T112583 Analyse results of AND operator A/B test (on or after 2015-09-29)
Resolved	• Ironholds_backup	T112582 Validate data for AND operator A/B test (on or after 2015-09-23)
Resolved	None	T112581 Run A/B test on relaxing AND operator for search (test starting on 2015-09-22)
Invalid	None	T112580 EPIC: Improve support for natural language queries
Resolved	• dcausse	T112178 Relax 'AND' operator with the common term query

Event Timeline

EBernhardson created this task.Sep 10 2015, 9:08 PM

EBernhardson raised the priority of this task from to Needs Triage.

EBernhardson updated the task description. (Show Details)

EBernhardson added projects: CirrusSearch, Discovery-Search (Current work).

EBernhardson subscribed.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptSep 10 2015, 9:08 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

EBernhardson renamed this task from Relaxing 'AND' operator with the common term query to Relax 'AND' operator with the common term query.Sep 10 2015, 9:09 PM

EBernhardson set Security to None.

• ksmith added a project: OKR-Work.Sep 11 2015, 8:57 PM

Increasing priority based on discussions held earlier today. This may be our best way to tackle the zero results problem that is our current goal

• Deskana added a parent task: T112580: EPIC: Improve support for natural language queries.Sep 14 2015, 8:07 PM

Short summary of the intent of this: if you have enough words (for some definition of "enough"), then switching from ANDs to ORs allows you to find more relevant results by not needing things like "the", "a", i.e. very common words, to be in the result.

• Deskana renamed this task from EPIC: Relax 'AND' operator with the common term query to Relax 'AND' operator with the common term query.Sep 14 2015, 8:20 PM

• Deskana removed a project: Epic.

• Deskana added a parent task: T112581: Run A/B test on relaxing AND operator for search (test starting on 2015-09-22).Sep 14 2015, 10:18 PM

• dcausse claimed this task.Sep 16 2015, 8:22 AM

• dcausse moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

Our experimental highlighter does not support CommontermsQuery properly T112746. I think I can workaround this problem and fix the highlighter later.

@TJones I think I need your help :)

This common terms query is available on suggesty and can be activated by adding the &cirrusUseCommonTermsQuery=yes to the query results page.
Currently the query is activated if there is at least 4 words in the query (see default profile in : https://gerrit.wikimedia.org/r/#/c/238767/2/profiles/CommonTermsQueryProfiles.php )

I tested two queries and it seems to work well :

In some cases it looks like relevance is a bit better (for question queries):

without what is snail slime made of
with what is snail slime made of

I'd like also to find sane defaults values for the profile (cutoff frequency, min should match...)
Any ideas on how we could evaluate this feature will be very welcome :)

This is a bit off-topic but somewhat related, the ? character has a special meaning in our syntax today. So if the query ends with a ? the common terms query won't be activated. I don't know what we should do because queries that ends with a ? are likely questions and not special syntax...

@dcausse, you've asked some tough questions!

I have one easy suggestion: cutoff_freq is much much too high at 1% (0.01). Given the Zipfian distribution, you'll get a handful (<10) words that are 1% or more of your corpus. For example, check out the list of words from Project Gutenberg, sorted by freq, with counts per billion.

Only 8 words are 1% or more. Only 94 words hit the 0.1% threshold (0.001), which is what is used in the ES documentation. So, 0.001 might be a reasonable default. And the cutoff for the strict setting should be higher than (or equal to), not lower than, the default. (Unless "0.01" is just a typo for "0.001" all around.)

I'd take 0.001 as the default, keep 0.001 for strict, and go as low as 0.000[567] for aggressive_recall, depending on how aggressive we are feeling.

Unfortunately, question words may not be as common in articles as they are in questions, and more so depending on the language. But for simplicity's sake, we should stick to raw term frequency for now.

The min_query_terms settings seem reasonable.

I'd lower the high_freq_min_should_match settings, and maybe allow a single high-freq word to be missing, except maybe for strict:

default: '0<0 1<50%' (if I've done that right, that's 0 required for 1, and 50% required for 2 or more)
strict: '75%' or even '0<0 1<75%' (0 out of 1, or 75% out of 2 or more)
aggressive_recall: '0<0 2<25%' (0 out of 1-2, or 25% out of 3 or more)

For low_freq_min_should_match, I like your settings, though I think for aggressive_recall, it should be '2<66%', since that means "for three or more", if I've read the docs right.

I've been a bit conservative here. I'm not against being more aggressive about labeling terms as stop words and requiring fewer of them (possibly none) to be present. The really do contribute so little. I think the low_freq_min_should_match settings are where the good stuff is going to happen. But I'm also happy with baby steps.

Is ? a problem? I searched for Who's Afraid of Virginia Woolf? and Who Framed Roger Rabbit? without any trouble. Help:Searching gives an example using a question, how do clocks work?, so I think we're okay there. (Is ? used other than in regexes?)

Evaluation is going to be hard. Zero Results Rate is easy to measure, but is probably too crude of a measure, as the snail slime example illustrates.

We could rig up a harness that would allow us to automatically detect differences in results from various settings. It won't tell us whether the results are better, but will let us tell whether the settings make any significant difference (because if only 0.05% of queries get a different set of results, it's probably not enough).

As discussed before, we could look at differences only in top 5 results, top 10, or whatever. We could also note re-ordering vs new entries. Depending on the effect size and sample size, we could manually look at results with new results (maybe or maybe not for results that are just re-ordered) to judge the value of the changes. For a 1K sample of queries, if 10% of results are affected, that's 100 queries to examine. (Or we could of course just randomly sample 100 queries to investigate, regardless of effect size.) At 2-3 minutes each (which is slow, I think), that's half a day to review them and see if the results are good or not. It'd be tedious, but doable, and would give us a rough idea of whether the results were obviously better or just low-precision junk.

Of course, an A/B test with the user satisfaction metric would be the best thing, but I don't know if that's going to happen in time.

Let me know if you want me to dig into anything else!

In other cases the results are maybe worse :

without how to learn piano
- Jazz piano is the first result, but Piano pedagogy is 2nd
with how to learn piano
- Piano is first (better than Jazz piano) but piano pedagogy is far behind ...

Thanks Trey!
I'll look into this tomorrow.

In T112178#1645997, @dcausse wrote:

In other cases the results are maybe worse :

I definitely think we should retest with cutoff freq at 0.1%, not 1%. Definitely good to find examples that get worse, though!

In T112178#1645996, @TJones wrote:

@dcausse, you've asked some tough questions!

I have one easy suggestion: cutoff_freq is much much too high at 1% (0.01). Given the Zipfian distribution, you'll get a handful (<10) words that are 1% or more of your corpus. For example, check out the list of words from Project Gutenberg, sorted by freq, with counts per billion.

Sure this setting is totally wrong.

Only 8 words are 1% or more. Only 94 words hit the 0.1% threshold (0.001), which is what is used in the ES documentation. So, 0.001 might be a reasonable default. And the cutoff for the strict setting should be higher than (or equal to), not lower than, the default. (Unless "0.01" is just a typo for "0.001" all around.)

I'd take 0.001 as the default, keep 0.001 for strict, and go as low as 0.000[567] for aggressive_recall, depending on how aggressive we are feeling.

Let's use 0.0006 for now

Unfortunately, question words may not be as common in articles as they are in questions, and more so depending on the language. But for simplicity's sake, we should stick to raw term frequency for now.

yes, we'll have to include lexical resources otherwise and I don't know where to put them currently.

The min_query_terms settings seem reasonable.

I'd lower the high_freq_min_should_match settings, and maybe allow a single high-freq word to be missing, except maybe for strict:

default: '0<0 1<50%' (if I've done that right, that's 0 required for 1, and 50% required for 2 or more)

strict: '75%' or even '0<0 1<75%' (0 out of 1, or 75% out of 2 or more)

aggressive_recall: '0<0 2<25%' (0 out of 1-2, or 25% out of 3 or more)

I think the syntax does not support 0<0 1<50%, I'll use 50% and see if it works as expected.

For low_freq_min_should_match, I like your settings, though I think for aggressive_recall, it should be '2<66%', since that means "for three or more", if I've read the docs right.

I've been a bit conservative here. I'm not against being more aggressive about labeling terms as stop words and requiring fewer of them (possibly none) to be present. The really do contribute so little. I think the low_freq_min_should_match settings are where the good stuff is going to happen. But I'm also happy with baby steps.

Is ? a problem? I searched for Who's Afraid of Virginia Woolf? and Who Framed Roger Rabbit? without any trouble. Help:Searching gives an example using a question, how do clocks work?, so I think we're okay there. (Is ? used other than in regexes?)

Look at the result differences :

This is because when ? is in a query term it triggers a wildard query. I'm not sure how to deal with that without breaking the existing syntax...

Evaluation is going to be hard. Zero Results Rate is easy to measure, but is probably too crude of a measure, as the snail slime example illustrates.

We could rig up a harness that would allow us to automatically detect differences in results from various settings. It won't tell us whether the results are better, but will let us tell whether the settings make any significant difference (because if only 0.05% of queries get a different set of results, it's probably not enough).

As discussed before, we could look at differences only in top 5 results, top 10, or whatever. We could also note re-ordering vs new entries. Depending on the effect size and sample size, we could manually look at results with new results (maybe or maybe not for results that are just re-ordered) to judge the value of the changes. For a 1K sample of queries, if 10% of results are affected, that's 100 queries to examine. (Or we could of course just randomly sample 100 queries to investigate, regardless of effect size.) At 2-3 minutes each (which is slow, I think), that's half a day to review them and see if the results are good or not. It'd be tedious, but doable, and would give us a rough idea of whether the results were obviously better or just low-precision junk.

Time has come to build more tools for our relevancy lab :)

Of course, an A/B test with the user satisfaction metric would be the best thing, but I don't know if that's going to happen in time.

Let me know if you want me to dig into anything else!

Another thing that is broken with the current implementation:
We have 2 fields:

plain: which is a basic word tokenizer
analyzed: which is analyzed with language specific features and produces stems and exclude stopwords

I included both fields in the common terms query but I think I should exclude the analyzed field. And this also explain why some queries are not returning results today even if a single stop word is missing.
I think the query format we use today will generate something like that for what is a stopwords :

(plain:what OR analyzed:what) AND (plain:is OR analyzed:is) AND (plain:a OR analyzed:a) AND (plain:stopwords OR analyzed:stopword)

And will be expanded to :

(plain:what) AND (plain:is) AND (plain:a) AND (plain:stopwords OR analyzed:stopword)

Making all stopwords required.

Change 238767 had a related patch set uploaded (by DCausse):
WIP: Add support for CommonTermsQuery

https://gerrit.wikimedia.org/r/238767

gerritbot added a project: Patch-For-Review.Sep 17 2015, 3:25 PM

In T112178#1648823, @dcausse wrote:

I think the syntax does not support 0<0 1<50%, I'll use 50% and see if it works as expected.

It seems to here in the current documentation for Minimum Should Match (and it's been supported since at least 1.4):

Multiple combinations 2<-25% 9<-3 Multiple conditional specifications can be separated by spaces, each one only being valid for numbers greater than the one before it.

Look at the result differences :

Who's Afraid of Virginia Woolf?

Who's Afraid of Virginia Woolf

This is because when ? is in a query term it triggers a wildcard query. I'm not sure how to deal with that without breaking the existing syntax...

Aha! I finally figured out what's happening. I had conflated wildcards and regexes and was expecting this to behave like a regex ? (which is does not) rather than a wildcard ?.

The Help:Searching docs are wrong then, in that they don't mention ? as a wildcard, and suggest it as an option for searching for a question!

Time has come to build more tools for our relevancy lab :)

You are never going to hear me disagree with that!

Another thing that is broken with the current implementation:
We have 2 fields:

plain: which is a basic word tokenizer

analyzed: which is analyzed with language specific features and produces stems and exclude stopwords

I included both fields in the common terms query but I think I should exclude the analyzed field. And this also explain why some queries are not returning results today even if a single stop word is missing.
I think the query format we use today will generate something like that for what is a stopwords :

(plain:what OR analyzed:what) AND (plain:is OR analyzed:is) AND (plain:a OR analyzed:a) AND (plain:stopwords OR analyzed:stopword)

And will be expanded to :

(plain:what) AND (plain:is) AND (plain:a) AND (plain:stopwords OR analyzed:stopword)

Making all stopwords required.

That's just weird, and I don't understand. Isn't Common Terms more sophisticated than just AND-ing and OR-ing , since it allows minimum_should_match? Do we use the analyzed field now? Searching on ''wanted'' or ''wants'' in enwiki does not give results for "want"; seems like the docs are out of date here, too, since stemming: doesn't work, either.

In T112178#1649490, @TJones wrote:

In T112178#1648823, @dcausse wrote:

I think the syntax does not support 0<0 1<50%, I'll use 50% and see if it works as expected.

It seems to here in the current documentation for Minimum Should Match (and it's been supported since at least 1.4):

Multiple combinations 2<-25% 9<-3 Multiple conditional specifications can be separated by spaces, each one only being valid for numbers greater than the one before it.

Thanks, I overlooked this section, I'll try it.

Look at the result differences :

Who's Afraid of Virginia Woolf?

Who's Afraid of Virginia Woolf

This is because when ? is in a query term it triggers a wildcard query. I'm not sure how to deal with that without breaking the existing syntax...

Aha! I finally figured out what's happening. I had conflated wildcards and regexes and was expecting this to behave like a regex ? (which is does not) rather than a wildcard ?.

The Help:Searching docs are wrong then, in that they don't mention ? as a wildcard, and suggest it as an option for searching for a question!

yes the doc is either outdated or wrong :/

Making all stopwords required.

That's just weird, and I don't understand. Isn't Common Terms more sophisticated than just AND-ing and OR-ing , since it allows minimum_should_match? Do we use the analyzed field now? Searching on ''wanted'' or ''wants'' in enwiki does not give results for "want"; seems like the docs are out of date here, too, since stemming: doesn't work, either.

Yes we are using stemming today but with a default AND and apparently this weird behavior of all stopwords being required. But a simple query wants should match also want, I'll try to find an example.

I updated suggesty with all your suggestions and made a special clause with the stemmed field with a default AND. The results are a bit worse, the zipf query still return results but far less :(. Maybe I should experiment with min should match on the stemmed field...

Thanks!

The results are a bit worse...

So glad to be of help! Just kidding. Dang.

This is why we need more relevance lab and a bit less intuition.

Your suggestions are all good :)
I just forgot to mention that it's because my previous implementation was totally wrong, and the interesting facts about kennedy assassinations query was returning too much results. Now after adding a default AND some interesting results are filtered by these not-so-common question words...