Page MenuHomePhabricator

production-logstash elastic cluster is yellow state
Closed, ResolvedPublic

Description

I noticed some alerts in icinga today about shards not allocated for some logstash100x nodes, this is the status:

elukey@logstash1008:~$ curl -s http://localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "production-logstash-eqiad",
  
  "status" : "yellow",    <======================================

  "timed_out" : false,
  "number_of_nodes" : 6,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 483,
  "active_shards" : 1103,
  "relocating_shards" : 0,
  "initializing_shards" : 0,

  "unassigned_shards" : 18,      <================================

  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 98.39429081177519
}

Some shards are not assigned:

elukey@logstash1008:~$ curl -s 'http://localhost:9200/_cat/shards' | grep -v STARTED | sort
logstash-2020.04.13           0 r UNASSIGNED
logstash-2020.04.14           0 r UNASSIGNED
logstash-2020.04.14           0 r UNASSIGNED
logstash-deploy-2020.04.12    0 r UNASSIGNED
logstash-deploy-2020.04.13    0 r UNASSIGNED
logstash-deploy-2020.04.14    0 r UNASSIGNED
logstash-deploy-2020.04.14    0 r UNASSIGNED
logstash-mediawiki-2020.04.12 0 r UNASSIGNED
logstash-mediawiki-2020.04.13 0 r UNASSIGNED
logstash-mediawiki-2020.04.14 0 r UNASSIGNED
logstash-mediawiki-2020.04.14 0 r UNASSIGNED
logstash-restbase-2020.04.13  0 r UNASSIGNED
logstash-restbase-2020.04.14  0 r UNASSIGNED
logstash-restbase-2020.04.14  0 r UNASSIGNED
logstash-syslog-2020.04.12    0 r UNASSIGNED
logstash-syslog-2020.04.13    0 r UNASSIGNED
logstash-syslog-2020.04.14    0 r UNASSIGNED
logstash-syslog-2020.04.14    0 r UNASSIGNED

With _cluster/allocation/explain?pretty I can see the following two reasons:

  • "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[logstash-2020.04.14][0]
  • "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=0.91], using more disk space than the maximum allowed [91.0%], actual free: [8.086810601726357%]"

This is the cluster allocation status:

elukey@logstash1008:~$ curl -s -k 'http://localhost:9200/_cat/allocation?v'
shards disk.indices disk.used disk.avail disk.total disk.percent host         ip           node
   377        7.5tb    13.2tb      1.1tb     14.3tb           92 10.64.0.181  10.64.0.181  logstash1010-production-logstash-eqiad
   358        6.5tb    12.2tb        2tb     14.3tb           85 10.64.48.177 10.64.48.177 logstash1012-production-logstash-eqiad
   368        7.4tb    13.2tb      1.1tb     14.3tb           91 10.64.16.30  10.64.16.30  logstash1011-production-logstash-eqiad
    18                                                                                     UNASSIGNED

From _cluster/settings this is the current config:

"disk" : {
  "watermark" : {
    "low" : "0.91",
    "high" : "0.93"

Options

  • drop replica count to 0 for "old" indices (e.g. older than 30d) thus keeping only one copy (per site)
  • drop old indices completely (e.g. older than 80d)
  • trim retention of Kafka topics: the logstash hosts also hosts kafka-logging
  • bump elasticsearch watermarks

Event Timeline

elukey triaged this task as High priority.Apr 14 2020, 6:40 AM
elukey created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 14 2020, 6:40 AM
elukey updated the task description. (Show Details)Apr 14 2020, 6:55 AM

Is there any for of index cleanup? It seems that we are indexing since January, maybe we could clean up some old indexes to regain space?

Thanks @elukey for looking into this! Yes there's elasticsearch curator to clean up old indices with 90d retention (e.g. on logstash1008). It seems that with daily growth we're indeed short on space on the logstash 5 cluster at the moment.

Options I think are viable for now:

  1. drop replica count to 0 for "old" indices (e.g. older than 30d) thus keeping only one copy (per site)
  2. drop old indices completely (e.g. older than 80d)
  3. trim retention of Kafka topics: the logstash hosts also hosts kafka-logging

+1 for option 1 as a non-destructive first step to regain space

Change 588740 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: reduce replica count to 0 after 80 days

https://gerrit.wikimedia.org/r/588740

Change 588740 merged by Herron:
[operations/puppet@production] logstash: reduce replica count to 0 after 80 days

https://gerrit.wikimedia.org/r/588740

Question to understand the next steps - Should https://gerrit.wikimedia.org/r/588740 reduce space used or does it need a manual clean up? I still see the cluster in yellow state :(

The change did reduce space but not enough as it seems! I'll keep digging

Some shards did get assigned due to the increased available space (see below) although not all, for now and to get the cluster to allocate all shards I'm thinking of pushing from 80 to 50 days of zero replicas (i.e. one copy per site).

$ curl -s 'http://localhost:9200/_cat/shards' | grep -v STARTED | sort
logstash-2020.04.14           0 r UNASSIGNED                                
logstash-2020.04.15           0 r UNASSIGNED                                
logstash-deploy-2020.04.12    0 r UNASSIGNED                                
logstash-deploy-2020.04.14    0 r UNASSIGNED                                
logstash-deploy-2020.04.15    0 r UNASSIGNED                                
logstash-mediawiki-2020.04.14 0 r UNASSIGNED                                
logstash-mediawiki-2020.04.15 0 r UNASSIGNED                                
logstash-restbase-2020.04.14  0 r UNASSIGNED                                
logstash-restbase-2020.04.15  0 r UNASSIGNED                                
logstash-syslog-2020.04.12    0 r UNASSIGNED                                
logstash-syslog-2020.04.14    0 r UNASSIGNED                                
logstash-syslog-2020.04.15    0 r UNASSIGNED

Change 588958 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: temporarily keep one copy after 50 days

https://gerrit.wikimedia.org/r/588958

Change 588958 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: temporarily keep one copy after 50 days

https://gerrit.wikimedia.org/r/588958

Mentioned in SAL (#wikimedia-operations) [2020-04-15T09:45:22Z] <godog> force-run curator from logstash1008 - T250133

fgiunchedi updated the task description. (Show Details)Apr 15 2020, 9:48 AM
fgiunchedi added a comment.EditedApr 15 2020, 1:40 PM

Curator still running on 2020-04-15 12:13:38,172 INFO forceMerging index logstash-mediawiki-2020.04.13 to 1 segments per shard. Please wait...

At any rate, the latest changed helped a bit but not on logstash1010 (yet?). Next up IMHO is to trim kafka retention a little, and/or reduce elasticsearch retention

(1) logstash1011.eqiad.wmnet                                                                              
----- OUTPUT of 'df -h /srv' -----                                                                        
Filesystem      Size  Used Avail Use% Mounted on                                                          
/dev/md1         15T   13T  2.1T  86% /srv                                                                
===== NODE GROUP =====                                                                                    
(1) logstash1012.eqiad.wmnet                                                                              
----- OUTPUT of 'df -h /srv' -----                                                                        
Filesystem      Size  Used Avail Use% Mounted on                                                          
/dev/md1         15T   11T  4.1T  72% /srv  
===== NODE GROUP =====
(1) logstash1010.eqiad.wmnet
----- OUTPUT of 'df -h /srv' -----
Filesystem      Size  Used Avail Use% Mounted on
/dev/md1         15T   14T  1.2T  92% /srv

Change 589043 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: reduce kafka-logging log retention to 5 days

https://gerrit.wikimedia.org/r/589043

Change 589043 merged by Herron:
[operations/puppet@production] logstash: reduce kafka-logging log retention to 5 days

https://gerrit.wikimedia.org/r/589043

elukey added a comment.EditedApr 16 2020, 5:50 AM

Cluster still yellow, 15 unassigned shards :(

It seems that 1010 holds way more shards than the other two:

elukey@logstash1008:~$ curl -s 'http://localhost:9200/_cat/shards' | awk '{print $8}' | sort | uniq -c
     15
    367 logstash1010-production-logstash-eqiad
    320 logstash1011-production-logstash-eqiad
    214 logstash1012-production-logstash-eqiad

Breakdown primaries/replicas is imbalanced too:

168 p logstash1010-production-logstash-eqiad
239 p logstash1011-production-logstash-eqiad
 76 p logstash1012-production-logstash-eqiad
 15 r
199 r logstash1010-production-logstash-eqiad
 81 r logstash1011-production-logstash-eqiad
138 r logstash1012-production-logstash-eqiad

Maybe it is just a matter of rebalancing the shard allocation? I have no idea how/if it can be done, but if so it should solve the problem quickly. logstash1012 should be able to take most of the load of 1010 in theory..

Or another thought: maybe the cluster does balance itself, but it is currently not doing it due to the watermarks being hit. What if we lift the thresholds and set something higher temporarily for low/high disk usage, to allow the 15 shards to be allocated, and then see how the cluster behaves?

Explain currently advice some things:

  • logstash1012-production-logstash-eqiad: the shard cannot be allocated to the same node on which a copy of the shard already exists [[logstash-2020.04.16][0]
  • logstash1011-production-logstash-eqiad: the shard cannot be allocated to the same node on which a copy of the shard already exists [[logstash-2020.04.16][0]
  • logstash1010-production-logstash-eqiad: the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=0.91]

I don't think it is just a matter of rebalancing the shards, all three hosts have the same specs and will eventually fill again until they can't allocate shards again.

I'm quite sure the culprit is the fact that kafka retention doesn't seem to be working at all (either old settings or new settings we just changed yesterday), for example

logstash1010:# ls -la /srv/kafka/data/udp_localhost-info-0/00000000005357257013.log
-rw-r--r-- 1 kafka kafka 1073741163 Jan  1 05:10 /srv/kafka/data/udp_localhost-info-0/00000000005357257013.log
fgiunchedi added a comment.EditedApr 16 2020, 9:08 AM

And indeed the last mention of e.g. udp_localhost-info in the log-cleaner logs is from Feb 28th 2019 (!!!)

root@cumin1001:~# cumin 'O:logstash::elasticsearch and logstash1*' 'grep localhost-info /var/log/kafka/log-cleaner.log' 
9 hosts will be targeted:
logstash[1007-1012,1023-1025].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                                                               
(1) logstash1010.eqiad.wmnet                                                                                                                                                                                         
----- OUTPUT of 'grep localhost-i.../log-cleaner.log' -----                                                                                                                                                          
[2019-02-27 20:42:44,829] INFO The cleaning for partition udp_localhost-info-1 is aborted and paused (kafka.log.LogCleaner)                                                                                          
[2019-02-27 20:42:44,864] INFO Compaction for partition udp_localhost-info-1 is resumed (kafka.log.LogCleaner)                                                                                                       
[2019-02-27 20:42:44,952] INFO The cleaning for partition udp_localhost-info-0 is aborted and paused (kafka.log.LogCleaner)
[2019-02-27 20:42:45,028] INFO Compaction for partition udp_localhost-info-0 is resumed (kafka.log.LogCleaner)
[2019-02-27 20:42:45,328] INFO The cleaning for partition udp_localhost-info-2 is aborted and paused (kafka.log.LogCleaner)
[2019-02-27 20:42:45,354] INFO Compaction for partition udp_localhost-info-2 is resumed (kafka.log.LogCleaner)
===== NODE GROUP =====                                                                                                                                                                                               
(1) logstash1011.eqiad.wmnet                                                                                                                                                                                         
----- OUTPUT of 'grep localhost-i.../log-cleaner.log' -----                                                                                                                                                          
[2019-02-28 16:35:53,893] INFO The cleaning for partition udp_localhost-info-0 is aborted and paused (kafka.log.LogCleaner)                                                                                          
[2019-02-28 16:35:53,956] INFO Compaction for partition udp_localhost-info-0 is resumed (kafka.log.LogCleaner)                                                                                                       
[2019-02-28 16:35:54,070] INFO The cleaning for partition udp_localhost-info-1 is aborted and paused (kafka.log.LogCleaner)
[2019-02-28 16:35:54,103] INFO Compaction for partition udp_localhost-info-1 is resumed (kafka.log.LogCleaner)
[2019-02-28 16:35:55,183] INFO The cleaning for partition udp_localhost-info-2 is aborted and paused (kafka.log.LogCleaner)
[2019-02-28 16:35:55,196] INFO Compaction for partition udp_localhost-info-2 is resumed (kafka.log.LogCleaner)
===== NODE GROUP =====                                                                                                                                                                                               
(6) logstash[1007-1009,1023-1025].eqiad.wmnet                                                                                                                                                                        
----- OUTPUT of 'grep localhost-i.../log-cleaner.log' -----                                                                                                                                                          
grep: /var/log/kafka/log-cleaner.log: No such file or directory                                                                                                                                                      
===== NODE GROUP =====                                                                                                                                                                                               
(1) logstash1012.eqiad.wmnet                                                                                                                                                                                         
----- OUTPUT of 'grep localhost-i.../log-cleaner.log' -----                                                                                                                                                          
[2019-02-28 19:00:42,434] INFO The cleaning for partition udp_localhost-info-2 is aborted and paused (kafka.log.LogCleaner)                                                                                          
[2019-02-28 19:00:42,462] INFO Compaction for partition udp_localhost-info-2 is resumed (kafka.log.LogCleaner)                                                                                                       
[2019-02-28 19:00:42,667] INFO The cleaning for partition udp_localhost-info-1 is aborted and paused (kafka.log.LogCleaner)
[2019-02-28 19:00:42,695] INFO Compaction for partition udp_localhost-info-1 is resumed (kafka.log.LogCleaner)
[2019-02-28 19:00:43,205] INFO The cleaning for partition udp_localhost-info-0 is aborted and paused (kafka.log.LogCleaner)
[2019-02-28 19:00:43,261] INFO Compaction for partition udp_localhost-info-0 is resumed (kafka.log.LogCleaner)
[2019-02-28 19:01:21,828] INFO The cleaning for partition udp_localhost-info-1 is aborted and paused (kafka.log.LogCleaner)
[2019-02-28 19:01:21,860] INFO Compaction for partition udp_localhost-info-1 is resumed (kafka.log.LogCleaner)

So it appears to me that kafka's log cleaner process isn't going over any topic other than the consumer offsets. Root cause and/or a fix still unclear to me, although the timeline coincides with T213898: Replace and expand Elasticsearch/Kafka storage in eqiad and upgrade the cluster from Debian jessie to stretch

Mentioned in SAL (#wikimedia-operations) [2020-04-16T16:45:15Z] <herron> kafka-logging eqiad set retention.bytes=500000000000 on topic udp_localhost-info T250133

Progress!

disk util on logstash1010 before adjusting udp_localhost-info retention.size:
/dev/md1 15T 14T 1.2T 92% /srv

disk util on logstash1010 after adjusting udp_localhost-info retention.size:
/dev/md1 15T 12T 2.5T 83% /srv

Mentioned in SAL (#wikimedia-operations) [2020-04-16T16:51:23Z] <herron> kafka-logging eqiad set retention.bytes=500000000000 on topic udp_localhost-warning T250133

herron added a subscriber: Ottomata.EditedApr 16 2020, 6:25 PM

Reducing kafka retention.size to 500G on udp_localhost-info and udp_localhost-warning did the trick in terms of freeing up space, and the eqiad ELK5 ES cluster is green once again.

The cause of the log cleanup not happening on some topics looks to be due a puzzling case of timestamps that are ~1y in the future in the affected topics, which @Ottomata tracked down.

A couple things we can look into further to help prevent this from happening again:

  • Set a log.message.timestamp.difference.max.ms value to reject logs with significant timestamp skew (andrew has an open Q to the kafka user mailing list re: what happens if this is a negative value as would be expected with future timestamps)
  • Adjust the rsyslog udp localhost compat template to use timestampgenerated instead of timestampreported when outputting the timestamp field to kafka.
  • Set a sensible topic retention size across the board (it's set to 500G only for the problem topics currently)
  • Other ideas?

Set a log.message.timestamp.difference.max.ms value

Setting this to the same value of log.rentention.ms might make sense.

It looks like this has been considered before:
https://issues.apache.org/jira/browse/KAFKA-4340

But was reverted due to issues with log compaction (which we don't use)
https://issues.apache.org/jira/browse/KAFKA-5344

ES cluster green! Great work everybody :)

fgiunchedi moved this task from Inbox to In progress on the observability board.Apr 27 2020, 12:22 PM

Set a log.message.timestamp.difference.max.ms value to reject logs with significant timestamp skew (andrew has an open Q to the kafka user mailing list re: what happens if this is a negative value as would be expected with future timestamps)

Got an answer:

If the difference if larger than log.message.timestamp.difference.max.ms the write will be rejected. This timestamp difference works both ways.

herron closed this task as Resolved.Jun 8 2020, 3:01 PM
herron claimed this task.