Page MenuHomePhabricator

logging-hd nodes evict-rejoin troubles
Open, Needs TriagePublic

Description

Logstash in codfw has been acting up lately. Initial working theory was that logstash-ml indexes are huge and we've hit watermarks preventing shards from being allocated to the hdd nodes. To fix this, logstash-ml partition is now truncated after 70d - this brought us under the watermarks.

While cleaning up a subset of these large indexes, one or two hdd nodes would be slow to respond to the coordinating node. When too many heartbeats are missed, the coordinator evicts the node from the cluster. The unhealthy node attempts re-joining and, being unable to keep responding to heartbeats in a timely manner, gets evicted repeatedly. If this evict-rejoin loop occurs on two nodes, the cluster goes red (primary shard unassigned for one-or-more indexes) and the cluster tries to recover what remains to expected replication state by copying shards to the remaining nodes.

The sign that a node is in this evict-rejoin loop is that the cluster changes state (yellow, red), network utilization drops sharply, and socket errors (tcp/attemptfails) sharply increase. The coordinating node also produces logs indicating the node is being evicted and attempting to rejoin repeatedly. Restarting OpenSearch on the thrashing node(s) seems to stabilize it long enough for the cluster to either recover or for another node to start the evict-rejoin loop.

I suspect disk latency may be contributing to the hosts getting into this evict-rejoin loop due to relocating shards where one node sends a copy of its shard data to another node over the network. Shards can relocate concurrently and if more than one shard on a single node are simultaneously being transferred in or out, the random disk I/O might be saturating and blocking other requests like heartbeats. recovery_max_bytes_per_sec was increased in Sept to 800mb take advantage of the 10Gb nics but reduced to 200mb while working on this instability.

What's surprising about this is that the presence of a thrashing node seems to halt ingest despite the target indexes being on a separate class of nodes. This is a weakness that needs more investigation.

Outstanding questions:

  • Can OpenSearch continue to ingest in red cluster state when working shards are healthy?
  • Can the evict-rejoin loop be prevented or short-circuited without manual intervention?

Treatment:

  • Identify the thrashing node via logs or metrics and run sudo systemctl restart opensearch_2@production_elk7_$(cat /etc/wikimedia-cluster).service on that host.

Related Objects

StatusSubtypeAssignedTask
Opencolewhite
OpenNone