According to our graphs[1] we are seeing latency spikes across all percentiles for queries from mediawiki to elasticsearch. Determine what is causing this and come up with ideas to fix it.
[1] https://grafana.wikimedia.org/dashboard/db/elasticsearch
According to our graphs[1] we are seeing latency spikes across all percentiles for queries from mediawiki to elasticsearch. Determine what is causing this and come up with ideas to fix it.
[1] https://grafana.wikimedia.org/dashboard/db/elasticsearch
With last weeks train deploy @dcausse added some new stats which break down these latency per-query type along with an optimization to the 'more like' queries which looks to have had a positive effect on the size of the spikes (but hasn't completely removed them). The breakdowns are visible in a new set of graphs[1] on grafana.
One thing this shows is that we are spending more time on "unknown" queries than we are on any other type, including searches. My intuition is that these are all related to indexing, but we are deploying a patch[2] in today's SWAT which should get rid of unknown and label all possible query types.
[1] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles
[2] https://gerrit.wikimedia.org/r/#/c/265146/
I've also noticed that GeoData does not go through our standard data collection pipeline, I've filled T124102 to get that brought into the fold.
Moving this to "In progress" since @EBernhardson says this is under active investigation. Feel free to move it back if that's not correct.
I don't know if it's a root cause, but we have been serving many more more_like queries in the past month than we have in the past (we think, didn't used to record this information). T124216 tracks caching those results for 24 hours which I've estimated would be an 80% reduction in more_like traffic.
Also related is T124258: Perform A/B test to determine if using opening_text instead of text as the field to perform more_like_this queries is better or not, where we will attempt to determine if this performance improvement could help with this.
Change 265667 had a related patch set uploaded (by EBernhardson):
Allow redirecting more like this to a different cluster
Change 265667 merged by jenkins-bot:
Allow redirecting more like this to a different cluster
Change 265932 had a related patch set uploaded (by EBernhardson):
Allow redirecting more like this to a different cluster
Change 265932 merged by jenkins-bot:
Allow redirecting more like this to a different cluster
latencies were looking bad again this weekend, deployed the patch to prod and everything immediatly looked happier again. Will review after the weekend is over if this has solved our problem.
Change 266558 had a related patch set uploaded (by EBernhardson):
Allow redirecting more like this to a different cluster
Change 266558 merged by jenkins-bot:
Allow redirecting more like this to a different cluster