According to our graphs we are seeing latency spikes across all percentiles for queries from mediawiki to elasticsearch. Determine what is causing this and come up with ideas to fix it.
- Mentioned In
- T124626: Elasticsearch health and capacity planning FY2016-17
- Mentioned Here
- T124258: Perform A/B test to determine if using opening_text instead of text as the field to perform more_like_this queries is better or not
T124216: Cache morelike API query results
T124102: GeoData extension should use ElasticsearchIntermediary::start and finish so it's requests are logged both in hadoop and in our stats collection
With last weeks train deploy @dcausse added some new stats which break down these latency per-query type along with an optimization to the 'more like' queries which looks to have had a positive effect on the size of the spikes (but hasn't completely removed them). The breakdowns are visible in a new set of graphs on grafana.
One thing this shows is that we are spending more time on "unknown" queries than we are on any other type, including searches. My intuition is that these are all related to indexing, but we are deploying a patch in today's SWAT which should get rid of unknown and label all possible query types.
I don't know if it's a root cause, but we have been serving many more more_like queries in the past month than we have in the past (we think, didn't used to record this information). T124216 tracks caching those results for 24 hours which I've estimated would be an 80% reduction in more_like traffic.
Also related is T124258: Perform A/B test to determine if using opening_text instead of text as the field to perform more_like_this queries is better or not, where we will attempt to determine if this performance improvement could help with this.
latencies were looking bad again this weekend, deployed the patch to prod and everything immediatly looked happier again. Will review after the weekend is over if this has solved our problem.