Page MenuHomePhabricator

Elasticsearch nodes overloading in eqiad
Closed, ResolvedPublic


Around 16:00 UTC saturday, and again around 5:00 UTC sunday multiple nodes in the eqiad search cluster were pegging cpu at 100%. elastic1027 was involved both times, but not certain yet if the node itself was the problem.

First time around restarting elasticsearch on elastic1027 resolved the issue, although it took around 20 minutes after the restart for things to clear up.

Second time around I shifted all more_like and regex traffic to codfw. That didn't seem to be enough to lighten the load, so I also shifted all enwiki traffic except comp_suggest and prefix to codfw. Things look to be under control now.

Related incident report:

Event Timeline

A previous time this happened we added some new metrics endpoints inside elasticsearch and started logging them to prometheus to collect per-node latency metrics based on stats buckets we provide at query time. Unfortunately the prometheus graphs seem empty. Should also see how to get these back, they would potentially help.

Change 503718 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] Repair elasticsearch per-node latency reporting

Patch does not fix overall problem, it fixes the per-node percentiles data collection which usually helps tracking down these kinds of problems.

We will need to put together an incident report monday as well.

I don't know it's necessarily related, but i noticed that full text qps is up in the last month. Over the last year we've been pretty consistent between 400-500qps, but since late march we've been at 550-650 or so.

Change 503718 merged by Gehel:
[operations/puppet@production] Repair elasticsearch per-node latency reporting

Looked into a few angles but nothing conclusive:

  • Looked at overall query rates to our primary query splits (full_text, comp_suggest, more_like). Only interesting piece is that our full_text query rate dropped back to 400-500qps (from 550-650) after switching traffic from eqiad to codfw. This made me suspect high levels of bot traffic, but further investigation didn't find anything interesting. This traffic may be spread over some large number of IP's instead of focused on small numbers of individual ip addresses.
    • Random theory: Some automated process is querying enwiki and tuned their concurrency when we were running against codfw cluster. Switching to eqiad cluster reduced latency from internal round trips and increased query rate. No proof.
  • Collected all ip addresses issuing > 200 queries per minute any time in april and graphed their query rate over the incident time ranges. No obvious correlations.
    • TODO: Generate some stacked graphs, so far only looked for outliers in terms of total queries per minute.
  • Pulled elasticsearch index recovery data (from /_recovery api endpoint of eqiad psi cluster) into pandas and looked over shard movements during incidents, again nothing conclusive or too out of the ordinary

Change 504375 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Return CirrusSearch to standard execution on eqiad cluster

Change 504375 merged by jenkins-bot:
[operations/mediawiki-config@master] Return CirrusSearch to standard execution on eqiad cluster

debt claimed this task.