Try defining the configuration as follows. Likely we need to improve the documentation in this area.
Fri, Apr 19
Only seeing timeouts against php-1.33.0-wmf.25, nothing against 1.34-wmf.1 yet. Should let this run for a week before declaring victory on the timeouts though.
Thu, Apr 18
A few variations that might be useful to test (using gor middleware to modify the queries). These would mostly inform our options for reducing server load if necessary for incident response:
- Reduce LTR rescore window
- Removing the LTR rescore
- Reduce popularity rescore window
Wed, Apr 17
Sorting functionality deployed and appears to be returning correct results now. Some followup will be needed for UI elements, probably T197525
Example query now returns appropriate results. It seems the processes involved here are all working as intended, calling this complete.
Tue, Apr 16
While not fully documented, our the results of previous load testing rounds and the methodology used are described here:
Looks like because insource uses a different connection than the standard connection (mitigation of cluster overloads over the weekend) the attempt to source the last sent request from the connection fails. Will need to get the right connection object into ElasticsearchIntermediary::multiFailure()
Mon, Apr 15
Looked into a few angles but nothing conclusive:
Sun, Apr 14
I don't know it's necessarily related, but i noticed that full text qps is up in the last month. Over the last year we've been pretty consistent between 400-500qps, but since late march we've been at 550-650 or so.
Patch does not fix overall problem, it fixes the per-node percentiles data collection which usually helps tracking down these kinds of problems.
A previous time this happened we added some new metrics endpoints inside elasticsearch and started logging them to prometheus to collect per-node latency metrics based on stats buckets we provide at query time. Unfortunately the prometheus graphs seem empty. Should also see how to get these back, they would potentially help.
Fri, Apr 12
As a very rough comparison, i pulled sum(irate(elasticsearch_indices_search_query_total[5m])) from prometheus, which gives 5 min averages for total shard queries executed per second across the cluster as 5 minute averages. We vary between about 12k and 21k shard queries per second, or about 840k to 1.25M per minute. This at least puts the volume of requests discussed here in the plausible range.
Thu, Apr 11
Should we rollback the addition to shared build until this can be resolved?
Related workaround: https://gerrit.wikimedia.org/r/#/c/mediawiki/vendor/+/503068/
Not expected, although hard to say what the error is. GeoData error handling needs to be updated log whatever the response it didn't like was
Wed, Apr 10
Tue, Apr 9
Created a horrible first draft that lists most of the properties and provides a short description for the ones used across most wikis. We should figure out how we want to format this before going much further:
This is an intentional feature added by the people behind AdvancedSearch. The high level goal there is for the URL to represent what is being searched. In particular if a user has a set of namespaces saved as their default search namespaces their search URL's will not be shareable. The exact implementation details are debatable, but the overall goal is reasonable. See T217445 for more details, discussion of the feature should likely also happen there.
Mon, Apr 8
- Can WikibaseCirrusSearch easily be updated to support the above queries?
Actually there might be a minimum delay hardcoded into the deleteByQuery code, will write up something to decay the delay from perhaps 100ms up to 5s over time
Some new 1200s timeouts from job's in Task.php came up today: dhttps://logstash.wikimedia.org/goto/314bfac86ad15374a5cd8223f8867cbd
If out of sync isn't a big deal, it seems the most direct and simplest way to resolve is to set conflicts=proceed and let them continue deleting instead of failing the delete-by-query.
Fri, Apr 5
After further review the inf loss wasn't actually a problem, that was just hyperopt reporting before any training runs had completed.
Doesn't seem to be needed anymore, feel free to start moving this to a more production configuration.
Thu, Apr 4
Synthetic benchmarks of runtime performance of CNN training in images/sec between CPU and WX9100. This oessentially confirms what we already know, that even a GPU that is not top of the line is an order of magnitude faster than training on cpu. Distributed training isn't a linear speedup, so it would likely take a significant portion of the hadoop cluster to achieve the same runtime performance as a single GPU. It's good to get a verification that the GPU is mostly working in this configuration. Note also that the current case can only fit a single gpu, but ideally future hardware would be purchased with the ability to fit at least 2 cards, or possibly 4 cards, in a single server.
that means no, miopen-opencl functionality is not supported within TF.
Something that would also need to be investigated, queries only return documents that have been refreshed (on 5s intervals). I suspect that documents that have been written to elasticsearch but not yet refreshed would not be deleted by a delete-by-query in that timespan. At a high level reading your writes is not guaranteed, or even expected, in elasticsearch as it is eventually consistent.
While related to search, someone from mobile frontend will likely need to take a look at this. Will leave it to them to triage the priority.
As insource in a special functionality of CirrusSearch, and the search UI is all in core, this is unfortunately not a 5-line patch. Not sure the best route to an implementation. There is also a design question of clutter in the UI that I'm not an expert on.
With the changes in packages now trying to run any model returns:
Still not seeing any job runner timeouts that are obviously related to this since the last once at 2019-03-25T22:52:37, should still probably leave this task around for a little while to check into this a few more times.
Wed, Apr 3
Not timeouts, but seeing a few delete-by-query failures in the logs now. These appear to be due to version conflicts, we can set conflicts=proceed to at least let the delete by query complete rather than abort mid-delete. This might require some input from @Nikerabbit and @abi_ regarding what is appropriate here. Basically what is happening is that during the delete-by-query operation some document that was supposed to be deleted was updated. What should happen in that case?
All results to demo query now show one highlighted thing, so that is progress. Some items still don't show a snippet, but clicking through to the result page I'm not sure what could have been displayed in the snippet anyways. The initial goal seems to be complete.