Page MenuHomePhabricator

Resolve elasticsearch latency alerts
Closed, ResolvedPublic

Description

Starting last week (TODO: When?) elasticsearch started alerting on latency with one or two nodes at 100% CPU. As we had just started a full-cluster reindex we assumed it was related and would resolve itself once the reindex is complete. The reindex has completed, but the problem remains.

Dig into the problem and find a root cause / solution.

Event Timeline

Various things attempted this morning:

  • A single node, 1024, had been at 100% cpu for about 4 hours. After investigating and finding nothing interested i decided to restart the instance at 15:56 UTC. This did resolve that particular node, but almost immediately 1030 spiked to 100% cpu usage. Looking through jstack there was a suspicious thread running org.elasticsearch.action.termvectors.TransportShardMultiTermsVectorAction.shardOperation which is very expensive (because we don't store them, as we don't expect to need them), but i only saw the one. AFAIK we don't use term vectors, what could be calling this?
  • We unfortunately don't have detailed enough per-node per-index logging to determine exactly what traffic shifted to 1030, but at the time i was recording output of elasticsearch task's api every minute. Looking at tasks started in a particular minute and assigned to the two nodes in question we get the following (note that the results from tasks api are almost certainly incomplete for tasks that start and finish between calls). The third level here is the index if it could be determined, otherwise the task action. This is also limited to items that are started at least 10 times during the minute. We can clearly see enwiki_(content|titlesuggest) shift from 1024 to 1030. This unfortunately doesn't tell us much, as the requests to enwiki should be spread reasonably well across the cluster. I could see some of the other indices hot-spotting on a single node or two but for enwiki that doesn't make sense.
datetime.datetime(2018, 4, 2, 15, 54): {'elastic1024': {u'enwiki_content': 47,
                                                        u'enwiki_titlesuggest': 45,
                                                        u'indices:data/read/msearch': 129},
                                        'elastic1030': {}},
datetime.datetime(2018, 4, 2, 15, 55): {'elastic1024': {u'dewiki_titlesuggest': 13,
                                                        u'enwiki_content': 48,
                                                        u'enwiki_titlesuggest': 47,
                                                        u'indices:data/read/msearch': 130},
                                        'elastic1030': {}},
datetime.datetime(2018, 4, 2, 15, 56): {'elastic1024': {},
                                        'elastic1030': {u'enwiki_content': 35,
                                                        u'indices:data/read/msearch': 20}},
datetime.datetime(2018, 4, 2, 15, 57): {'elastic1024': {u'indices:data/read/msearch': 19},
                                        'elastic1030': {u'enwiki_content': 38,
                                                        u'enwiki_titlesuggest': 17,
                                                        u'indices:data/read/msearch': 58}},
datetime.datetime(2018, 4, 2, 15, 58): {'elastic1024': {u'enwiki_content': 13,
                                                        u'indices:data/read/msearch': 33},
                                        'elastic1030': {u'enwiki_content': 38,
                                                        u'enwiki_titlesuggest': 27,
                                                        u'indices:data/read/msearch': 70}},
datetime.datetime(2018, 4, 2, 15, 59): {'elastic1024': {u'indices:data/read/msearch': 12},
                                        'elastic1030': {u'enwiki_content': 46,
                                                        u'enwiki_titlesuggest': 33,
                                                        u'indices:data/read/msearch': 81}},
  • Killed the transfer from hadoop -> elasticsearch eqiad at 16:58 UTC. On the chance this was creating excess load I killed it, but it doesn't appear to have made any difference.
  • Shifted all enwiki search traffic to the codfw cluster at 17:36 UTC. cluster-wide p95 dropped from 1500ms to 100ms almost immediately. Shifted only enwiki traffic, rather than all traffic, to split the load and not cause the same issue in codfw. Perhaps worthwhile to shift all traffic to codfw and see if the problem repeats. This is of course not a long term solution, but gives users calling search a significantly better experience.

For reference we have seen this symptom before, nodes hitting 100% and staying there. That was caused by the analysis chain for zh turning all punctuation into commas and then running phrase queries over a bunch of punctuation. This allowed individual queries to run over their timeouts with some search tasks reported by the tasks api running for 20+ minutes. Over the two hours that i collected output from the tasks api this time around no search reported running for more than 1 minute, so I believe this is something completely different with a similar symptom.

Change 423532 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Shift all search traffic to codfw

https://gerrit.wikimedia.org/r/423532

Change 423532 merged by jenkins-bot:
[operations/mediawiki-config@master] Shift all search traffic to codfw

https://gerrit.wikimedia.org/r/423532

Mentioned in SAL (#wikimedia-operations) [2018-04-02T23:44:07Z] <ladsgroup@tin> Synchronized wmf-config/InitialiseSettings.php: [[gerrit:423532|Shift all search traffic to codfw (T191236)]] (duration: 00m 59s)

Mentioned in SAL (#wikimedia-operations) [2018-04-02T23:45:36Z] <ladsgroup@tin> Synchronized tests/cirrusTest.php: [[gerrit:423532|Shift all search traffic to codfw, part II (T191236)]] (duration: 00m 58s)

Change 427278 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Shift search traffic back to eqiad

https://gerrit.wikimedia.org/r/427278

Change 427278 merged by jenkins-bot:
[operations/mediawiki-config@master] Shift search traffic back to eqiad

https://gerrit.wikimedia.org/r/427278

Mentioned in SAL (#wikimedia-operations) [2018-04-17T23:17:35Z] <ebernhardson@tin> Synchronized wmf-config/InitialiseSettings.php: T191236: Shift search traffic back to eqiad (duration: 01m 17s)

Ran two weeks on codfw without a single alert from that cluster. Surprisingly the eqiad cluster *still* managed to have a few latency alerts without any load on the cluster (although this could be completely unrelated, and with low load a few long queries could spike the percentiles). I've shifted traffic back to eqiad now to see if the cluster has resolved the issue itself, perhaps by rebalancing shards around the cluster, or if we need to look into it more.

It's perhaps worth noting that problem servers look to be split fairly cleanly into only the "older" servers which are due to be replaced next year. Rather than digging into this too deeply we could try and reduce load on the cluster (longer more like caches, smaller full text rescore windows) if latency issues persist, until the nodes are replaced.

High level theory: We are over consuming some resource on the machines. This is basically IO (network, disk), CPU, and Memory. IO was a problem in the past, but doesn't look like a problem this time around. So i grabbed intel's performance counter monitor and used it to look at some top level cpu/io stats and look for differences between the older servers performing poorly, and the newer servers that are doing well.

Test setup:

  • Eqiad cluster idle except for indexing traffic
  • Run query normalization routine from MjoLniR to generate load from 50 hadoop executors
  • Routine limited to enwiki_content index
  • Choose two machines serving replicas of the same shard (elastic1017 and elastic1043) from the two separate groups of servers.
  • Monitor memory usage with pcm-memory.x and

Results: P7005
Summary:

  • 1043, a well behaving server:
    • shows memory usage heavily favoring numa node 0, but balancing well across available memory channels
    • reports an IPC of ~1.4
  • 1017, a misbehaving server:
    • shows memory usage also heavily favoring numa node 0, but also heavily focused on a single memory channel. Each channel should have ~12GB/s of bandwidth available, but 1017 is regularly hitting 8-9GB+
    • reports an IPC varying from 0.7 - 1.0

Theory:

  • elastic1017 (verified a few other servers as well) is putting too much pressure on a single memory channel
  • This is leading to reduced IPC as the cores wait around for memory requests

Caveats:

  • 1017 has an ivy bridge processor, 1043 uses haswell. IPC differences could be due to the architecture change, or even the JVM generating different assembly. 30-100% performance improvement per cycle is relatively unheard of for a single update these days though.

Proposed Solution:

  • First obvious solution is to better spread memory access between the channels. But i have no idea how to manage that
  • Next choice is to better spread memory access between the two numa nodes. This can be accomplished by starting elasticsearch with numactl --interleave=all -- /usr/bin/elasticsearch ...

Limited test:

  • Restarted elasticsearch on 1017 with numactl (via hack of /usr/lib/systemd/system/elasticsearch.service)
  • Used cluster reroute api to put enwiki_content shard back on the machine. The cluster also decided to put 2x enwiki_general shards there
  • Observed p95 latency drop from ~250ms to ~150ms
  • While i brought load back with an enwiki_content shard, elasticsearch was certainly not serving the same shards it was before
  • pcm-memory.x reports memory bandwidth balanced evenly between the two numa nodes. The first channel of each is still accessed more heavily than the others
  • pcm-core.x reports IPC climbed to 1.0-1.1

Extended test:

  • Going to restart 1018, 1019, 1020, 1022 and 1023 with interleaved numa and monitor for a few days. This gives us 6 machines with numa interleave, and 7 machines in the default configuration to compare. I think the per-node latencies are probably the best metric to judge the difference on.

Ori had suggested looking at memory installed, and it brought up a pertinent point with respect to the memory bandwidth:

  • The older servers have 3 memory channels and 4 sticks per socket. This means the first channel has 32GB of memory, and the other two channels have 16GB each.
  • The newer servers have 4 memory channels and 4 sticks per socket for an even balance.

I don't think this is a problem, but it shows that one memory channel should be roughly twice as busy as the others on these machines. We are actually seeing up to 5x the bandwidth required of this first channel though.

Cluster just alerted on latency again. elastic1027 and 1025 has pushed server load > # of cores and are now doing about 2x the latency of other servers. The initial spike started at about 21:40 UTC, i was able to start taking measurements at 21:50 UTC after it alerted. Latency was still abnormally high, resolving around 22:15.

Logged into 1027 and pulled 60s samples of cpu and memory usage to see if it lines up with previous conclusions.

IPC on 1027 is reporting an *amazingly* low IPC. Memory bandwidth is quite lop-sided.

timeIPCsocket 1 channel 1 bandwidthtotal memory bandwidth
21:500.378.9GB/s13GB/s
21:510.438.9GB/s13GB/s
21:520.448.5GB/s12.7GB/s
21:530.438.7GB/s13GB/s

Overall this seems to line up with previous measurements on a busy but not overloaded machine. It's worth noting that this alert only involved old machines that have not been updated with interleaved numa, but it's still probably a bit early to call a win.

Change 428372 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] Enable NUMA awareness in elasticsearch JVM

https://gerrit.wikimedia.org/r/428372

Reviewing per-node latency graphs since rolling out the numa --interleave=all approach looks like a success. Pretty much all of the minor latency spikes above cluster baseline are coming from the older systems that have not had interleave enabled.

While reviewing numa interleave and jvm I came across a jvm flag -XX:+UseNUMA which informs the JVM to work with numa to avoid having
all memory access to a single numa node. Suggesting to roll that out first to the remaining older nodes, and use that information to inform if we should also interleave or if numa awareness was sufficient.

Change 428372 merged by Gehel:
[operations/puppet@production] Enable NUMA awareness in elasticsearch JVM

https://gerrit.wikimedia.org/r/428372

Restarted elastic1024-1031 with -XX:+UseNUMA. Inspecting the state of the jvm memory maps it looks like this causes the jvm to allocate three separate memory regions from the kernel instead of the single allocation it used before. This allocation is split between a shared heap that is interleaved between numa nodes, and then two allocations that are pinned to a specific node. The main benefit of numa awareness instead of a brute force interleave would be to have better memory locality so i took a look (again with intel pcm):

numa awareness% local dram requests
default57%
interleave50%
UseNUMA82%

This looks to mostly have resolved the latency issues, although now looking at who is poking their head up in the latency groups i noticed elastic1048-51 are showing similar problems but to a lesser extent. It turns out these machines have 4x 32GB ram, instead of the 8x 16GB in the other servers. This means each socket only has 2 memory channels instead of the 4 we would expect, although these channels run at higher speed than 17-31. Currently on 1049, which is reporting higher latency, i'm seeing 12GB/s for one numa zone (6GB per channel), and 700MB/s on the other. In general i think i the future we should ensure we match the # of ram sticks to the # of available channels where possible, 2x the memory bandwidth wouldn't hurt.

It's perhaps also worth considering that GC will likely behave a little differently with numa awareness enabled. At a minimum i've seen that GC will not compact across numa regions, which means those parts of the heap are working with some fraction of the heap instead of the whole thing.

Latency numbers for p95 and lower all look great, probably the most stable they've been since bringing load back to eqiad. p99 is still a bit spiky, some quick looks through graphs suggests a correlation between high io-wait and p99, but I think investigating that will need to be prioritized separately from this ticket. This looks to most likely be resolved, although a full cluster restart should follow to bring things into a consistent state. Currently -XX:+UseNUMA has been deployed to all the machines, but only 1024-31 have been restarted in this configuration. The cluster restart will coincide with some plugin updates we have in the pipeline as well.

cluster restart is stopped at the moment, waiting to validate the impact on kafka of freezing writes

debt subscribed.

update: the primary fix was rolled out manually to the ~15 servers with the problem, so the ticket itself is resolved. The full cluster restart so they are all in the same state is not finished yet though. marking this as resolved.

Mentioned in SAL (#wikimedia-operations) [2018-04-30T13:12:59Z] <gehel> restarting elasticsearch codfw rolling restart for plugin update and NUMA config - T191543 / T191236

Mentioned in SAL (#wikimedia-operations) [2018-05-02T07:36:05Z] <gehel> elasticsearch eqiad rolling restart for plugin update and NUMA config - T191543 / T191236

Mentioned in SAL (#wikimedia-operations) [2018-05-03T09:09:42Z] <gehel> rolling restart of elasticsearch completed - T191543 / T191236