Thu, Feb 22
I suppose the :9109 in the instance names is a bit annoying, in that is makes the list of instances much longer. Really though there are too many instances to list and it needs to be further filtered (top-N?) anyways.
I put together a very basic attempt at a first dashboard: https://grafana.wikimedia.org/dashboard/db/elasticsearch-per-node-percentiles?orgId=1
The overall numbers look sane and roughly what is expected.
Ran a quick test. To get the highlighter to return results i needed to add a highlight_query which referenced the text, and i needed to add the .prefix subfields to the per-field highlighter configuration matched_fields. Looks fixable but will need to poke around a bit to see how it should be implemented to do this.
Looking at the query we build and send to elasticsearch, it seems we might need to provide an appropriate query to the highlighter? Some testing should be able to tell.
Wed, Feb 21
After some testing my theory is the performance on eqiad is due to to our ltr model cache size. It 's currently set to the default of 10MB. If i run the ltr model in a series of requests at only 1 concurrent query i get poor performance. Ramping up to 5 concurrent queries per-query performance is still poor but improving. At 15 concurrent queries things seem to generally improve, with nothing over 500ms in the last few dozen of the 1k. My interpretation of this is that the more likely the model is to be in the cache the faster the request, and the model is being evicted very quickly. Unfortuntaely the cache size is not dynamically updatable, we have to do a rolling restart to update it. I've filed a bug upstream for this. Additionally i don't have stronger proof the cache is undersized because we only report number of items and size in bytes for cache stats, nothing about loads or invalidations from which trends over time could indicate churn. I've filed an upstream bug about that as well.
I've also now created a limited feature set with just what the mrmr model needs and uploaded it to the prod clusters. Query time on the codfw cluster looks really good, around 125ms. Query time on the eqiad cluster on the other hand is hanging out around 750ms which is unacceptable. Going to have to dig into why the queries take significantly different amounts of time on the different clusters. I'm suspicious it's not actually feature collecting, but maybe model caching? This is because if i adjust the query to only have 6 results, so only 6 possible items could have been provided to feature collecting across 7 shards, i still see ~750ms. This suggests some sort of constant cost rather than a per-item cost like feature collection.
After evaluating found the previous iteration was missing all the features related to the plain fields due to a bug. Re-collected the data and came up with 0.8627. Unfortunately there is something odd going on at evaluation time. I uploaded the 0.8628 model as [[ https://en.wikipedia.org/wiki/Special:Search?search=kennedy&fulltext=1&cirrusMLRModel=20180118-query_explorer_v2-enwiki-v1 | 20180118-query_explorer_v2-enwiki-v1 ] and it takes on average around 2s to return results. Taking the query being issued here and running it against codfw comes back at 300ms. 300ms is stlil probably too expensive, but 2s is amazing. The eqiad cluster load doesn't seem high enough (~25% cpu, minimal io) to make the queries 6x slower. Unfortunately it looks like the ltr plugin deployed in prod doesn't work with the elasticsearch profile api and throws an exception.
Tue, Feb 20
I have a test model up that currently pushes Chris to the top: https://en.wikipedia.org/wiki/Special:Search?search=Ingo+Heinrich&fulltext=1&cirrusMLRModel=20180118-query_explorer_v2-enwiki-v1
Thu, Feb 15
I can add a phrase match, but i'm not sure phrase match will be enough to push this single occurance all the way to the top. For this specific page a redirect containing the name might do the trick, but that's not as generalizable. Longer term I think named entity extraction has some potential here. A phrase match and a named entity match might (but needs to be evaluated) be enough.
@Vvekbv I think first steps for this project would be to review the paper and review the aggregation performed in the associated patch, to verify that the input vectors are generated correctly according to the paper. Once we think the data collection is correct i can run it against some time period and put a sample of data (probably saved as a scipy.sparse matrix, for loading into python) on https://analytics.wikimedia.org/datasets/discovery/ to start working with.
Tue, Feb 13
One of the things that surprised me by attending the GSOC mentor summit in 2016 was that the scope of our tasks was much more limited than some of the other orgs. For example this task i think is fairly comparable to Learning to rank Clickstream Mining at https://trac.xapian.org/wiki/GSoCProjectIdeas but is a bit more focused on a specific idea. I think we will have to be careful to only accept a student with enough background/experience to have a chance at finishing, but I don't think that needs to prevent students from applying.
Built out feature collection for these, the dataset sizes are a bit larger than expected. xgboost datasets for enwiki when maintaining 35M observations is roughly 50GB. Interestingly the lightgbm dataset for the same data is only 4GB. Similarly while xgboost needed 10 executors with 16G memory each to run training, lightgbm looks like it will only require 1 executor with 10G of memory. Might be worth evaluating if the ltr plugin can correctly evaluate lightgbm trees (if we convert them to xgboost format).
Mon, Feb 12
From the application side it seems like i could perhaps put the temp files in a better place. One option would be PWD of the application which looks to be something like:
Thu, Feb 8
Attached patch is a small proof of concept, but to go from there to a full evaluation of NCM vs DBN is going to be a large amount of work I can't find time for in the immediate future. This would be a great project for someone with interest in applying NN to practical problems.
Wed, Feb 7
- NCM -> Neural Click Model
- DBN - Dynamic Bayesian Network, the algorithm we currently use to generate labels for MLR: http://olivier.chapelle.cc/pub/DBN_www2009.pdf
Next up: Train with more hyperopt iterations (300? 500?), to see if continued search is beneficial.
Next up: Experiment with variations in number of trees for small wikis, does current setting of 500 help?
Tue, Feb 6
First up: Train the same data with the same hyperparameter space multiple (3) times to get an idea of expected variance
Wed, Jan 31
Jan 23 2018
Nothing automatic as far as I'm aware either. Cirrus integration test's use the api to create all it's pages.
Jan 19 2018
If you are looking for search quality, typically what would be done is:
Jan 18 2018
For the moment, patch switches wiktionary on enwiki to use the title filter. We can ponder a bit on how to improve the relevance of the wiktionary search, or how to filter results that happen to be there but aren't particularly good.
I realized the two posts above had serious leakage between test and train sets, so here we go again this time we split the training data into 3 folds, we train against each fold a model with labels on a 0-9 scale and a model with labels on a 0-3 scale, then we evaluate both models against both scales
Jan 17 2018
Jan 11 2018
I'd have to pull in someone with more stats background to be sure, but I think that cutting off a survey at a specific number of responses can be a bit too early. An example I've seen used:
T183941 is probably relevant, but T184767 less so. The problem with T184767 is we want to spread the surveys out. If we ask for 100 impressions per week we don't want to get 100 impressions all in the first hour and end the survey, which T184767 seems to be implying. For 100/week we want around 14 /day spread across the days/hours of operation to get all the different kinds of users.
Ahh, i was only thinking JVM dependencies and didn't remember the new python dep as well. Makes sense that that is taking awhile to compile, wish they would ship binary wheel's like the other packages.
Another important aspect, at least for how search is using micro-surveys, is per-page sampling rates. Or more specifically targeting # of impressions per week on a per-page basis, rather than site wide.
Jan 9 2018
Jan 3 2018
Jan 2 2018
Closer reading of the report:
the model is very accurate with at least 40 yes/no/unsure/dismiss responses and the most accurate with at least 70 responses
sending the basic info semi-synchronously (from DeferredUpdates, which will run in the same process as the edit but after closing the connection to the user so as not to make save timing worse) should be ok. Actually generating a "basic" set instead of the full thing might be more difficult than necessary though, i would be tempted to add a called to Updater::updateFromTitle(...) and let it do the full thing. Since article creates should be relatively (compared to total edit rate) rare i don't think the extra computation expense out-weights the maintenance cost of keeping an extra bit of code to generate partial updates, including getting the labels from wikidata, without also calculating the rest of it.
@mpopov I wasn't quite sure from https://wikimedia-research.github.io/Discovery-Search-Adhoc-RelevanceSurveys/#responses_required , is 40 to 70 responses the number of impressions (yes+no+dismiss+timeout), the number of clicks (yes+no+dismiss), or the number of yes+no? I think it was yes+no+dismiss, but it might have been yes+no+dismiss+timeout?
We may still need to look into the special-case of newly created pages being indexed from the web request, rather than being punted into the job queue. cirrusSearchLinksUpdatePrioritized, which performs the actual generation of a document and write to elasticsearch, looks to have a p99 that regularly varies from 30 to 60 seconds. This is on top of however long it takes for the refreshLinksPrioritized job which is another 20s - 2 minutes for p99. For some fraction of requests, even when the queue is healthy, there will be a couple minutes between the edit being performed and the two necessary jobs making it through the job queue and turned into a write in elasticsearch.
Dec 21 2017
I don't know that it would be particularly useful here, but i have a WIP patch to expose our language detection via the mediawiki API that could potentially be useful for external language detection. For accessing in bulk from hadoop (like a UDF would) hitting a mediawiki api is probably undesirable.
This can currently be done using the insource regex search. We might need to think how to reword the "exactly this text" field in the new search interface since, as you've noted, it's not strictly that text. What it does is skip things such as stemming which converts cats into cat or simplifying the character set (depends on language used), but it still tokenizes into words and drops non-words such as # and -.
Dec 20 2017
Took some measurements of refresh rate averaged over 5 minutes pre and post-deployment. Overall it's perhaps a 15% increase in refresh/minute across the cluster. Disk IO graphs don't show anything particularly interesting. There will certainly be more merge volume as well but elasticsearch should be able to bundle up the merges enough that these tiny merges are irrelelvant compared to the major merges that happen on many-GB segments.
Dec 19 2017
It seems there are a couple options here, my thoughts:
Probably the biggest change I'm aware of is elasticsearch removing index types. We will need to come up with different solutions for what we have previously used those index types for. This can be done prior to moving to elastic 6 though.
Unfortunately all of the elasticsearch-specific metrics are no exposed over jmx. We can get generic JVM info that way, but for the specialized stats we have to query the elasticsearch APIs.
@Gehel It looks like we need to release stretch packages for all our custom elastic stuff (kibana, logstash, es, plugins?). My intuition is that since this is all JVM it should "just work" on stretch, is it possible for apt.wikimedia.org to build the appropriate package files without rebuilding all the debs? An alternate solution I've seen in the wild is to not use the distribution name in the apt line, but that always seemed a bit of a hack.
Dec 18 2017
Thanks! I'll try this out this week and see how things go.
chomedriver has to be started to listen on :4444. There might be some magic way to get nodejs to spawn this but I'm not sure how. I generally start it myself using:
Unrelated to dashboards, but for prometheus. We will likely need a fork (or an additional custom collector) to collect extra metrics that are only reported by our cluster. Specifically we collect per-node latency percentiles into a custom api endpoint on the elasticsearch servers. This isn't even in diamond yet as we only recently upgraded the plugin version on the cluster to expose these metrics.
load testing and percentiles could certainly go away. cluster recovery might be useful at some point in the future but hard to say. All of that data is found in other dashboards though, just not broken out by server and in a single board. It certainly wont be immediately useful, and if the data is there it can be recreated as necessary.
Dec 15 2017
Overall looks pretty close. Some parts are perhaps underdefined imo. So for example when you type into the main search bar on Special:Search you get autocomplete results, does that count as starting your search with autocomplete? Those are disambiguated in the event logging with the 'inputLocation' field. I might be tempted to throw out the autocomplete on the main search bar since its completing the query to be submitted, instead of completing a page title to go to.
Dec 14 2017
The two attached patches are not complete solutions, that would still require a rethinking of how we store redirects, but it should at least paper over the problem from the users perspective.
Dec 13 2017
Dec 12 2017
Dec 8 2017
For the moment, you can use this to get exact matches: insource:/umfaßt/
Dec 7 2017
I suppose for a little more background on what i think is happening:
If this is two years old then likely things are working now. We use the hhvm debugger in wmf prod (mwrepl calls it) and break points and such seem to work.
Dec 5 2017
We really need a centralized place to store all these queries and expected results with different parameters. The key to making effective search is to have a set of queries and known good results, and then be able to evaluate changes to the system in how it affects all of those queries.