Page MenuHomePhabricator

Logstash elasticsearch cluster filled up, dropping logstash events
Closed, ResolvedPublic

Description

We were dropping events for almost 5 hours because the Elastic data nodes (logstash1004-6) had disks more than 85% full, which is the 'low watermark' for ElasticSearch - it stops accepting new shards there.

I've increased the low watermark to 90% and deleted bunch of indexes for older data, so we are out of the woods for now. Not a long term fix, however.

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added a subscriber: yuvipanda.

The low disk watermark is the point at which it will not allocate the next days index. I'm not exactly sure what the high watermark will do in this cluster. What it does it try and move shards off of the machine to other machines in the cluster. But the indices here are setup to store all indices on all nodes, there is nowhere to move the shard to. It might just drop a replica? I don't see anything in the network graphs that indicates it regularly having to restore a replica so i doubt it.

The machines have 2.7TB of disk for elasticsearch

95% gives 135GB free per node. This is about about 35GB larger than the largest shards and should mean that regular compaction could occur. Because of the way this cluster is used only one 100GB compaction should happen per day per node. If the one month old index is deleted before this compaction occurs the space would always be available.

90% gives 270GB free per node. With most days under 100GB this is reasonable, but could trigger the high watermark during daily compaction. I'm guessing the right amount for the low watermark is going to be the high watermark + 200GB + fudge factor

Another option would be moving the runJobs channel out of logstash. I could be reading this wrong, but it looks like runJobs accounts for about half of the events in logstash.

I just checked the disk utilization on logstash100[4-6] and found that they all have at least 400G of free space. The largest daily index we have right now is 127G so it looks like we have plenty of headroom right now for a full copy compression as long as we tune the watermarks appropriately.

Another option would be moving the runJobs channel out of logstash. I could be reading this wrong, but it looks like runJobs accounts for about half of the events in logstash.

If we get enough daily traffic that we are hitting a 90-95% utilization frequently I would suggest that the first tuning we do is reducing the replica count from 2 to 1 selectively. Right now each index is present on each data host in the cluster. If we started dropping that back having the data for a given index present on only 2/3 of the cluster we would still be fairly robust from individual node failure and regain quite a bit of disk on each node.

I would suggest that this decrease in redundancy be phased in slowly as space is needed. For the first pass we could drop that replica count on days N-22 to N-30 and when we outgrew that start dropping it sooner and sooner until we only kept full copy for days N and N-1. That would give us quite a bit of headroom on the current hardware and should easily carry us forward until we can budget for adding nodes/disk to the cluster in the next fiscal year.

bd808 triaged this task as Medium priority.Oct 20 2015, 3:42 PM

Happening again now on logstash1004:

$ curl -XPOST 'localhost:9200/_cluster/reroute?pretty' -d '{
    "commands" : [ {
          "allocate" : {
              "index" : "logstash-2015.11.02", "shard" : 0, "node" : "logstash1004"
          }
        }
    ]
}'
{
  "error" : "RemoteTransportException[[logstash1005][inet[/10.64.16.185:9300]][cluster:admin/reroute]]; nested: ElasticsearchIllegalArgumentException[[allocate] allocation of [logstash-2015.11.02][0] on node [logstash1004][dzZrNJHNTvGUwKLIPEialA][logstash1004][inet[/10.64.0.162:9300]]{master=true} is not allowed, reason: [YES(shard is not allocated to same node or host)][YES(node passes include/exclude/require filters)][YES(primary is already active)][YES(below shard recovery limit of [4])][YES(allocation disabling is ignored)][YES(allocation disabling is ignored)][YES(no allocation awareness enabled)][YES(total shard limit disabled: [-1] <= 0)][YES(target node version [1.7.1] is same or newer than source node version [1.7.1])][NO(more than allowed [90.0%] used disk on node, free: [8.639487251588491%])][YES(shard not primary or relocation disabled)]]; ",
  "status" : 400
}

Our logstash-2015.10.14 index is massively oversize compared to others due to some very verbose logging that was enabled for that day which may be contributing to this.

Happening again now on logstash1004:

I did two things to recover from this:

  1. Deleted logstash-2015.10.03 index about 8 hours earlier than it normally would have been deleted by a cron job to free disk space
  2. Reduced the replica count from 2 to 1 for logstash-2015.10.04 thru logstash-2015.10.12

Before:

logstash1004:~
bd808$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             10M     0   10M   0% /dev
tmpfs            26G  819M   25G   4% /run
/dev/md0        235G  2.3G  221G   2% /
tmpfs            63G     0   63G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            63G     0   63G   0% /sys/fs/cgroup
/dev/md1        2.8T  2.5T  240G  92% /var/lib/elasticsearch

After:

logstash1004:~
bd808$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             10M     0   10M   0% /dev
tmpfs            26G  827M   25G   4% /run
/dev/md0        235G  2.3G  221G   2% /
tmpfs            63G     0   63G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            63G     0   63G   0% /sys/fs/cgroup
/dev/md1        2.8T  2.3T  478G  83% /var/lib/elasticsearch

Utilization went down from 92% to 83% (~238G freed). I have created T117438 and associated gerrit patch to add a daily cron job to drop the replica count of the NOW-21days index just after GMT rollover each day.

Another option would be moving the runJobs channel out of logstash. I could be reading this wrong, but it looks like runJobs accounts for about half of the events in logstash.

Sent an email to ops-l and engineering-l asking if anyone would miss runJobs info level logs (job X ended with status Y). This seems to be 50% of our current log volume and also very low signal in the current state since the detailed job information is not structured in any way that is useful for searching/filtering.

Another option would be moving the runJobs channel out of logstash. I could be reading this wrong, but it looks like runJobs accounts for about half of the events in logstash.

Sent an email to ops-l and engineering-l asking if anyone would miss runJobs info level logs (job X ended with status Y). This seems to be 50% of our current log volume and also very low signal in the current state since the detailed job information is not structured in any way that is useful for searching/filtering.

Consensus from the email thread is that the runJobs completion events are mostly unused in the ELK stack and having the verbose logs on fluorine is sufficient.

Change 251317 had a related patch set uploaded (by BryanDavis):
logstash: Exclude runJobs info events from logstash

https://gerrit.wikimedia.org/r/251317

Change 251317 merged by jenkins-bot:
logstash: Exclude runJobs info events from logstash

https://gerrit.wikimedia.org/r/251317

At 2015-11-06T00:34Z the config change to exclude runJobs log events with severity lower than warning was deployed to the production cluster. Prior to the config change, we were recording 3M-4M runJobs events per hour in Logstash. After the change are are recording 100-200 runJobs events per hour. Our log event volume is down nearly 66% from the prior nominal rate. Over the next 30 days we should see a fairly dramatic rise in disk space available on the ELK cluster.

Screen Shot 2015-11-06 at 10.01.26.png (175×1 px, 45 KB)