Maniphest T208917

Build pipeline to transform elastic explains into feature vectors and a tf graph
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Nov 7 2018, 12:33 AM

Description

The primary goal of this pipeline is to take an elasticsearch query, run it with explain enabled, and collect the hits for some number of query strings. It should be able to extract all the scoring components from the explain into a single dict per hit, and an explain must be able to convert
into a tensorflow graph that will, when given the input feature vectors, score the hits.

At a high level the idea is to be able to run the scoring equation outside elasticsearch with all the components of the equation pre-calculated. Variables in the equations can then be tuned and the results re-run without having to query elasticsearch directly. In testing some simple queries can run at 1M hits/sec on modest hardware which opens up more tuning possibilities.

Details

Subject	Repo	Branch	Lines +/-
Add jupyter notebooks for wbsearchentities analysis	wikimedia/discovery/relevanceForge	master	+720 -0
Cli and optimizers for es explains	wikimedia/discovery/relevanceForge	master	+1 K -13
Implement elasticsearch explain parser	wikimedia/discovery/relevanceForge	master	+3 K -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	debt	T193701 Explore using user clicks data to tune Wikidata search parameters
Resolved	debt	T205111 [EPIC] Transform wikidata autocomplete click logs into a useful dataset
Resolved	EBernhardson	T208917 Build pipeline to transform elastic explains into feature vectors and a tf graph

Event Timeline

EBernhardson created this task.Nov 7 2018, 12:33 AM

Change 472067 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[wikimedia/discovery/relevanceForge@master] Implement elasticsearch explain parser

https://gerrit.wikimedia.org/r/472067

gerritbot added a project: Patch-For-Review.Nov 7 2018, 12:34 AM

EBernhardson moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.Nov 7 2018, 8:53 PM

I've run this a few times with a few different settings. All inputs start with the set of all logged english wikidata item autocompletes for october. Some configurations:

5% of item searches, 50 hits per query
20% of item searches, 50 hits per query
20% of item searches, 250 hits per query

So far all three of these return very similar results. Optimization consistently starts with expected satisfaction at 6.7 characters typed, and all of them manage to reduce that to 6.1-6.2 characters typed. This suggests that the same approach might be applicable to languages with 5% or less of the traffic we see in english.

Next up I need to build proper test/train splits into the evaluation so we can generate usable numbers.

These are not validated, but to give an idea of what is being tuned and what values are being chosen:

variable	initial value	tuned value
constant_score/labels.en.near_match	2.0	2.2028
constant_score/labels.en.near_match_folded	1.6	0.9184
constant_score/labels.en.prefix	1.1	1.4492
constant_score/labels_all.near_match_folded	0.001	0.0016*
query_weight	1.0	0.2543
rescore_query_weight	1.0	0.3741
rescore/0/incoming_links	0.6	1.0043
rescore/1/sitelink_count	0.4	0.2793
rescore/2/P31=Q4167410	-1.0	-0.1023
rescore/3/P31=Q13442814	-0.5	-0.4977
rescore/4/P31=Q18918145	-0.5	-0.8572

The maximum value we will try to tune with is 2*inital, so this didn't try and expand beyond 0.002 but might improve.

Very interesting stuff. Thanks for sharing the numbers. A few things come to mind.

First is that a reduction from 6.7 to 6.1 characters typed is about a 9% decrease, which is quite respectable! It's hard to compute a minimum practical score, but it's clearly got to be >2, and seems probable that it's >4, so this is a good step in a good direction.

Since it's cheap to run, have you tried some ridiculous or semi-ridiculous starting values, like making all the initial values 2x or even 10x? It's possible that the initial values (which were hand-tuned?) are fairly close to a local optimum, but maybe letting things run a little wild could find something new and nifty. If the parameter space is smooth, you'll end up right back here, but if it's particularly lumpy in any direction you could find something surprising. (Having a train/test split also makes that less scary, since you can validate that you haven't wildly overfit your data.)

Minor tangent: should we tell the Wikidata UI folks to make the suggester's "more" button bigger? We have pretty good empirical proof that people have trouble clicking on it.

Major tangent on non-alphabetic inputs

[Sorry.. I started thinking about this and got carried away.]

I'm very curious what the optimization does for Chinese or other writing system with a very large number of distinct characters. Hmm, actually, Chinese and Korean input methods may have a big effect on this. I think Korean is a little more standard, in that there are 24 base characters (jamo) that combine into the 11K syllabic blocks, and those 24 characters are on the keyboard (Chinese input methods vary a lot.)

So, there are just three characters in “얼비스” (“Elvis”), but they are typed as seven characters: ㅇ ㅓ ㄹ ㅂ ㅣ ㅅ ㅡ. The input software groups them into syllables on the fly based on where the consonants (C) and vowels (V) are, and it will sometimes guess that CVCC is a single syllable until it sees CVCCV, at which point the last C has to start the next syllable. As you type, then, your successive inputs are as follows:

ㅇ
어
얼
얿 (← a wrong guess about where the consonant goes)
얼비
얼빗 (← another wrong guess about where the consonant goes)
얼비스

Your successive inputs are not this:

얼
얼비
얼비스

For Korean, you can algorithmically decompose syllable blocks into individual characters and then algorithmically recompose prefixes into syllable blocks to recreate the 7-step input above. The differences in input there are like the US vs French keyboard—the same characters are on the keyboard in different places. But I don't know what to think about Chinese (oh, and Japanese, too!)—there's a lot going on there and you can't necessarily reconstruct the input prefix sequence from the final input string.

Dang, language is messy.

Since we don't have per-language settings anyway (at least for now), I think we can start with tuning the weights for English and hoping it works for other languages too. Of course, that may be wrong - and we may want to verify it - and if it turns indeed to be wrong, we may want to implement per-language weights maybe.

OTOH, there's no hard link between language and input - i.e. you can search for a Korean label (presumably thus using Korean input method) while specifying language as English (though this would be relatively rare), thus I think our best hope still is that we can improve all languages with a single profile. I think most interesting part there is relative weights of query-dependent and query-independent scores and relative scoring of various sub-components, so hopefully language differences would not change it too much.

Change 472077 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[wikimedia/discovery/relevanceForge@master] [WIP] Cli and optimizers for es explains

https://gerrit.wikimedia.org/r/472077

Change 472067 merged by jenkins-bot:
[wikimedia/discovery/relevanceForge@master] Implement elasticsearch explain parser

https://gerrit.wikimedia.org/r/472067

There wasn't really a perfect ticket for this report, but the analysis of this first tuning run on item search in english is uploaded to phab: F27316679

Not sure it would be worth figuring out, but I realized while testing the deployed AB code that this model doesn't take into account how likely the user is to even see a set of suggestions. It assumes users see the results for each and every letter they type, while in the real world some users type the entire label before search comes back with a result. To check this I compared the number of characters actually typed vs expected in the eventlogging:

actual mean	expected mean
7.3	6.5

This suggests there is certainly something the model does not account for that leads users to type more characters than expected. It's entirely possible to model probabilistically which of the characters typed are rendered on the users screen, but that would require updating our eventlogging. Overall i still think while the model used for the tuning metric doesn't exactly match the users, it is a useful metric to improve upon.

Change 472077 merged by jenkins-bot:
[wikimedia/discovery/relevanceForge@master] Cli and optimizers for es explains

https://gerrit.wikimedia.org/r/472077

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Dec 7 2018, 6:47 PM

Overall i still think while the model used for the tuning metric doesn't exactly match the users, it is a useful metric to improve upon.

I agree. It's even possible that better results, and the best result being higher-ranked, could lead to changes in user behavior over time. If you know you usually have to type 6-8 characters before anything good comes up, that's what you are going to do without looking at the suggestions. Habits die hard.

If you know you usually have to type 6-8 characters before anything good comes up, that's what you are going to do without looking at the suggestions.

Also, a lot of people copy-paste stuff from other places. Not sure how this is counted, but it happens a lot, especially with power-editors.

debt closed this task as Resolved.Dec 10 2018, 9:12 PM

Change 486159 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[wikimedia/discovery/relevanceForge@master] Add jupyter notebooks for wbsearchentities analysis

https://gerrit.wikimedia.org/r/486159

Change 486159 merged by jenkins-bot:
[wikimedia/discovery/relevanceForge@master] Add jupyter notebooks for wbsearchentities analysis