Page MenuHomePhabricator

Determine root cause of weekend latency spikes in elasticsearch cluster
Closed, ResolvedPublic

Description

According to our graphs[1] we are seeing latency spikes across all percentiles for queries from mediawiki to elasticsearch. Determine what is causing this and come up with ideas to fix it.

[1] https://grafana.wikimedia.org/dashboard/db/elasticsearch

Event Timeline

EBernhardson updated the task description. (Show Details)
EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson assigned this task to dcausse.
EBernhardson added subscribers: EBernhardson, Joe.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 19 2016, 10:00 PM

With last weeks train deploy @dcausse added some new stats which break down these latency per-query type along with an optimization to the 'more like' queries which looks to have had a positive effect on the size of the spikes (but hasn't completely removed them). The breakdowns are visible in a new set of graphs[1] on grafana.

One thing this shows is that we are spending more time on "unknown" queries than we are on any other type, including searches. My intuition is that these are all related to indexing, but we are deploying a patch[2] in today's SWAT which should get rid of unknown and label all possible query types.

[1] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles
[2] https://gerrit.wikimedia.org/r/#/c/265146/

I've also noticed that GeoData does not go through our standard data collection pipeline, I've filled T124102 to get that brought into the fold.

Deskana triaged this task as High priority.Jan 20 2016, 6:35 PM
Deskana added a subscriber: Deskana.

Moving this to "In progress" since @EBernhardson says this is under active investigation. Feel free to move it back if that's not correct.

I don't know if it's a root cause, but we have been serving many more more_like queries in the past month than we have in the past (we think, didn't used to record this information). T124216 tracks caching those results for 24 hours which I've estimated would be an 80% reduction in more_like traffic.

Also related is T124258: Perform A/B test to determine if using opening_text instead of text as the field to perform more_like_this queries is better or not, where we will attempt to determine if this performance improvement could help with this.

Change 265667 had a related patch set uploaded (by EBernhardson):
Allow redirecting more like this to a different cluster

https://gerrit.wikimedia.org/r/265667

Change 265667 merged by jenkins-bot:
Allow redirecting more like this to a different cluster

https://gerrit.wikimedia.org/r/265667

Change 265932 had a related patch set uploaded (by EBernhardson):
Allow redirecting more like this to a different cluster

https://gerrit.wikimedia.org/r/265932

Change 265932 merged by jenkins-bot:
Allow redirecting more like this to a different cluster

https://gerrit.wikimedia.org/r/265932

latencies were looking bad again this weekend, deployed the patch to prod and everything immediatly looked happier again. Will review after the weekend is over if this has solved our problem.

Change 266558 had a related patch set uploaded (by EBernhardson):
Allow redirecting more like this to a different cluster

https://gerrit.wikimedia.org/r/266558

Change 266558 merged by jenkins-bot:
Allow redirecting more like this to a different cluster

https://gerrit.wikimedia.org/r/266558

Deskana closed this task as Resolved.Feb 17 2016, 5:18 PM