Tue, Oct 16
We have a week worth of autocomplete data for wikidata so i took a stab at this. It's only in a python notebook on SWAP but will hopefully clean it up into something. Currently it generates daily extracts from merged cirrus + eventlogging logs an writes them to ebernhardson.wikibasecompletionclicks in hive. I've then run the click/skip counts over them and generated the following examination probabilities
Mon, Oct 15
Fri, Oct 12
Attached patch doesn't really solve the goal of this patch, which seems to identify what happened in the sept 13 deploy to change the behaviour of a previously working query, but it does simplify the query used so we might stop seeing this error.
Thu, Oct 11
Proposed deployment process:
Problems outline in the description are detailed in docs/multi_cluster.txt in the patch. As far as I'm aware this patch resolves all of the machinery necessary in CirrusSearch to deploy multi cluster.
Wed, Oct 10
In a quick test:
Tue, Oct 9
Fri, Oct 5
This will allow indexing the data, using that data for something depends on the use case. The most direct method is to implement a full text search keyword via the CirrusSearchAddQueryFeatures hook.
Thu, Oct 4
relatedly these should move to the standardized nodejs infra: T206229
Wed, Oct 3
Per the tags from ReleaseTaggerBot this should have been deployed with wmf.22, and we are now on wmf.23 with .24 rolling out. So this should be deployed.
Tue, Oct 2
We started calling Revision::getFirstRevision (via WikiPage::getOldestRevision) in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CirrusSearch/+/433986/3/includes/Updater.php as part of T195071. Nothing should be new to this code path on the CirrusSearch side since that patch was deployed in late may.
mjolnir tox jobs look to be running on m4executor, not sure what changed but this doesn't seem to be a problem anymore.
Mon, Oct 1
Search config needs to know the name of the cluster to connect to. Currently it only has the local wiki config and not the config of the wiki being searched
Work has started in T205111 to collect wikidata autocomplete click data and use the click data to perform offline evaluation of a proposed autocomplete ranker. The ability to evaluate the relative quality of multiple rankers is an essential first step to being able to tune the ranker.
I think for the purposes of this ticket we can call it complete. There isn't a whole lot that can be done about the network issues from the mediawiki side besides the already merged patch to fail gracefully. From the elasticsearch side the oversized impact (dropping requests for ~2 minutes) of this is expected to be mitigated by ongoing work to reduce the number of shards per cluster.
After re-reviewing my code this morning I found a bug where all of the scores were off by one (so first position was 1/2 instead of 1/1). Re-running gives much higher numbers, but in relative terms things are pretty similar
Fri, Sep 28
Thinking about this, MPC MRR might be considered the optimal ordering if we don't have the ability to use additional context per query. If my line of thinking is correct MPC should be the maximum possible MRR on this click dataset if we have to return the same result set for the same prefix every time. MPC could be significantly improved on if short prefixes could vary their results based on some sort of context clues.
I have a working example of this now, will need some cleanup and test cases written before uploading to gerrit.
Thu, Sep 27
See the sort parameter to the search api, added in the last few months: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=search&srsearch=qqq&srinterwiki=1&srsort=relevance
insource:/\~\~\~\~/ works as expected
Another useful reference, this follows the development of autocomplete from MPC to ~2016: https://www.slideshare.net/YichenFeng1/tutorial-on-query-autocompletion
Wed, Sep 26
Tue, Sep 25
If the master does not receive acknowledgement from at least discovery.zen.minimum_master_nodes nodes within a certain time (controlled by the discovery.zen.commit_timeout setting and defaults to 30 seconds) the cluster state change is rejected
The part about minimum_master_nodes is slightly confusing. Our discovery.zen.minimum_master_nodes is set to 2, but this isn't clear if 2 master capable nodes must ack the state update, or if any 2 nodes could ack the update. Additionally though the logstash messages only mention a timeout and not that the update was rejected.
Mon, Sep 24
In addition to this, see T205348 for updating eventlogging to collect enough information to calculate examination probabilities.
We need to update wikidata and search satisfaction logging to do one of the following:
- Record the searchToken for each result set displayed
- Record the list of pages for each result set displayed. This would probably necessitate recording an event for each displayed search.
- Keep an in-memory history of autocomplete results. When an item is selected look back through the history for all the possible prefixes and record what prefixes and what position held the item finally selected
Other metrics that i've seen used for autocomplete in my review:
I spent most of friday digging through User model based metrics for offline query suggestion evaluation. The two metrics provided, eSaved and pSaved, are not too dissimilar from MRR. The simpler of the two is pSaved. In pSaved the metric is the sum of P(S_ij = 1) across i and j where i represents the number of letters typed, and j represents the position of suggestion. P(S_ij=1) is defined relatively simply: The user is satisfied if the correct result is provided and the user examines that result. The paper gives a relatively simple algorithm for looking over your interaction logs and calculating the probability of examination. eSaved is a modification of pSaved that further accounts for length.
Fri, Sep 21
This is a nice survey of the query autocomplete literature (circa 2016): https://staff.fnwi.uva.nl/m.derijke/wp-content/papercite-data/pdf/cai-survey-2016.pdf
Thu, Sep 20
Potentially the reason this drug on longer than it should have is timeouts updating the cluster state: https://logstash.wikimedia.org/goto/1f67b952e7da4dec76fc66addb6b901b
Is this from the task reindexing the same cluster that is having a restart (codfw, iirc)? I don't think elasticsearch tasks or scrolls are able to move between hosts and will fail in the face of one of the nodes participating restarting. The error message also means we don't detect that case, there is probably no particular error handling in the Cirrus side of the reindexing code to detect errors.
This is now blocked until roughly december waiting for data to populate the indices before it can be used