Page MenuHomePhabricator

logstash - cron failing to optimize indices
Closed, ResolvedPublic

Description

I see some logs from crons on logstash servers indicating failing index optimization (see below for example).

A correction was already made post elasticsearch 2.3 upgrade.

Looking at crons on logstash1001, I see multiple crons that seem similar. This makes me think that a rename of the cron in puppet happened, without cleanup of the previous cron.

gehel@logstash1001:~$ sudo crontab -l | grep optimize
# Puppet Name: apifeatureusage_optimize_index
40 1 * * * /usr/local/bin/logstash_optimize_index.sh http://10.2.2.30:9200 "apifeatureusage-$(date -d '-1day' +\%Y.\%m.\%d)"
# Puppet Name: logstash_optimize_index
10 1 * * * /usr/local/bin/logstash_optimize_index.sh http://127.0.0.1:9200
# Puppet Name: logstash_optimize_index_apifeatureusage
40 1 * * * /usr/local/bin/logstash_optimize_index.sh 10.2.2.30:9200 "apifeatureusage-$(date -d '-1days' +\%Y.\%m.\%d)"
# Puppet Name: logstash_optimize_index_logstash
20 1 * * * /usr/local/bin/logstash_optimize_index.sh 127.0.0.1:9200 "logstash-$(date -d '-1days' +\%Y.\%m.\%d)"

Those crons seem to be cluster wide (index optimization is done for the cluster, not for individual node) but they are deployed on all nodes. This should not cause any issue (multiple calls to optimize will block waiting for the previous one to complete), but is probably not necessary.

extract of email received from cron:

Subject: Cron <root@logstash1001> /usr/local/bin/logstash_optimize_index.sh 127.0.0.1:9200 "logstash-$(date -d '-1days' +%Y.%m.%d)"
To: root@logstash1001.eqiad.wmnet


Failed to optimize 127.0.0.1:9200/logstash-2016.07.19

{
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "indices" : {
    "logstash-2016.07.19" : {
      "shards" : {
        "0" : [ {
          "routing" : {
            "state" : "STARTED",
            "primary" : true,
            "node" : "4UXNV6w_TPySDt73KkT-mQ"
          },
          "num_committed_segments" : 44,
          "num_search_segments" : 44,
          "segments" : {
[...]
rom: Cron Daemon <root@logstash1002.eqiad.wmnet>
Date: Thu, Jul 21, 2016 at 9:58 AM
Subject: Cron <root@logstash1002> /usr/local/bin/logstash_optimize_index.sh http://127.0.0.1:9200
To: root@logstash1002.eqiad.wmnet


Failed to force a synced flush for http://127.0.0.1:9200/logstash-2016.07.20

HTTP/1.1 409 Conflict
Content-Type: application/json; charset=UTF-8
Content-Length: 379

{"_shards":{"total":3,"successful":2,"failed":1},"logstash-2016.07.20":{"total":3,"successful":2,"failed":1,"failures":[{"shard":0,"reason":"no commit id from pre-sync flush","routing":{"state":"STARTED","primary":false,"node":"lYF1WLrRRgmRjTGUViqzZA","relocating_node":null,"shard":0,"index":"logstash-2016.07.20","version":4,"allocation_id":{"id":"27FdfhXBR42ftwXUI0NuKQ"}}}]}}

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel triaged this task as Medium priority.Jul 21 2016, 8:54 AM

The crons being on all role::logstash nodes was intentional because as you say multiple invocations of the script are idempotent. There is no hard guarantee that any given node will be up at any given time so this provides redundancy.

There does seem to generally be a problem with old cron entries that didn't get cleaned up.

Ok, redundancy makes sense. I can delete all puppet managed crons and re-run puppet, which should cleanup old left over.

Mentioned in SAL [2016-07-21T17:07:20Z] <gehel> cleaning leftover crons on logstash* servers - T140973

Old crons have been cleaned. Let's wait a bit to see if we have other errors before closing this.

Gehel claimed this task.

No further errors seen in logs.