Page MenuHomePhabricator

scap on beta fails canary check: KeyError: 'aggregations'
Closed, ResolvedPublic

Description

Spotted at https://integration.wikimedia.org/ci/job/beta-scap-eqiad/

09:07:29 09:07:29 Executing check 'Check endpoints for deployment-mediawiki-07.deployment-prep.eqiad.wmflabs'
09:07:30 09:07:30 Check 'Check endpoints for deployment-mediawiki-07.deployment-prep.eqiad.wmflabs' failed: /wiki/{title} (Main Page) is CRITICAL: Test Main Page returned the unexpected status 302 (expecting: 200); /wiki/{title} (Special Version) is CRITICAL: Test Special Version returned the unexpected status 302 (expecting: 200)
09:07:30 
09:07:30 09:07:30 Finished Canary Endpoint Check Complete (duration: 00m 00s)
09:07:30 09:07:30 Waiting for canary traffic...
09:07:49 09:07:49 Executing check 'Logstash Error rate for deployment-mediawiki-07.deployment-prep.eqiad.wmflabs'
09:07:49 09:07:49 Check 'Logstash Error rate for deployment-mediawiki-07.deployment-prep.eqiad.wmflabs' failed: Traceback (most recent call last):
09:07:49   File "/usr/local/bin/logstash_checker.py", line 332, in <module>
09:07:49     main()
09:07:49   File "/usr/local/bin/logstash_checker.py", line 328, in main
09:07:49     sys.exit(checker.run())
09:07:49   File "/usr/local/bin/logstash_checker.py", line 233, in run
09:07:49     entries = r['aggregations']['2']['buckets']
09:07:49 KeyError: 'aggregations'
09:07:49 
09:07:49 09:07:49 Canary error check failed for 1 canaries, less than threshold to halt deployment (2/1), see https://logstash-beta.wmflabs.org/goto/ff3530979bc5a54b9779ac9cbd4fc819 for details. Continuing...
09:07:49 09:07:49 Finished sync-check-canaries (duration: 00m 42s)

https://integration.wikimedia.org/ci/job/beta-scap-eqiad/313583/console

The script is in Puppet: modules/service/files/logstash_checker.py and has not been touched in a few months.

Reproduction

With verbose mode (-v):

ssh deployment-deploy01.deployment-prep.eqiad.wmflabs  \
  /usr/local/bin/logstash_checker.py \
    -v --service-name mwdeploy --host deployment-mediawiki-07.deployment-prep.eqiad.wmflabs --logstash-host deployment-logstash03.deployment-prep.eqiad.wmflabs:9200 \
   --fail-threshold 10.0 --delay 5
DEBUG: logstash response {u'status': 400, u'error': {u'root_cause': [{u'reason': u'Trying to query 1247 shards, which is over the limit of 1000. This limit exists because querying many shards at the same time can make the job of the coordinating node very CPU and/or memory intensive. It is usually a better idea to have a smaller number of larger shards. Update [action.search.shard_count.limit] to a greater value if you really want to query that many shards at the same time.', u'type': u'illegal_argument_exception'}], u'type': u'illegal_argument_exception', u'reason': u'Trying to query 1247 shards, which is over the limit of 1000. This limit exists because querying many shards at the same time can make the job of the coordinating node very CPU and/or memory intensive. It is usually a better idea to have a smaller number of larger shards. Update [action.search.shard_count.limit] to a greater value if you really want to query that many shards at the same time.'}}

Event Timeline

LarsWirzenius edited projects, added Puppet; removed Scap.

This doesn't actually look like it's in Scap: the script is not part of Scap, and the place where it crashes is using JSON data it gets from logstash. As such, I think this needs to be fixed in logstash or the logstash_checker.py script. I don't think there's anything that Scap can to do fix this. If I"m wrong, please explain, thanks.

hashar updated the task description. (Show Details)

I have reproduced it with:

ssh deployment-deploy01.deployment-prep.eqiad.wmflabs \

/usr/local/bin/logstash_checker.py \
  -v --service-name mwdeploy --host deployment-mediawiki-07.deployment-prep.eqiad.wmflabs --logstash-host deployment-logstash03.deployment-prep.eqiad.wmflabs:9200 \
 --fail-threshold 10.0 --delay 5

Logstash yields an error:

status: 400
type: illegal_argument_exception

Trying to query 1247 shards, which is over the limit of 1000.
This limit exists because querying many shards at the same time can make the job of the coordinating node very CPU and/or memory intensive. It is usually a better idea to have a smaller number of larger shards.
Update [action.search.shard_count.limit] to a greater value if you really want to query that many shards at the same time.

The checker does not handle the error at all.

The query that triggers the shard_count limit error is:

{
            "size": 0,
            "aggs": {
                "2": {
                    "date_histogram": {
                        "interval": "10s",
                        "field": "@timestamp"
                    }
                }
            }, 
            "query": {
                "bool": {
                    "filter": [
                        {
                            "range": {
                                "@timestamp": {
                                    "lte": "now",
                                    "gte": "now-60m"
                                }
                            }
                        },
                        {
                            "query_string": {
                                "query": "host:deployment-mediawiki-07.deployment-prep.eqiad.wmflabs AND type:mediawiki AND channel:(exception OR error)"
                            }
                        }
                    ],
                    "must_not": [
                        {
                            "terms": {
                                "level": [
                                    "DEBUG"
                                ]
                            }
                        }
                    ]
                }
            }
        }

Note that I do not know what an ElasticSearch shard is or whether the query ends up hitting 1247 of them.

The whole system is a bit broken. There is a task T238707 to migrate from deployment-logstash02 (with almost empty /srv) to deployment-logstash03.

deployment-logstash03 has a full /srv partition due to a large number of indices (could it be related to the shard count limit):

find /srv/elasticsearch/labs-logstash-eqiad/nodes/0 -name index|wc -l
1254

So if I assert that a shard is one of those indices, if we cleanup the old indices there the query would work again. I have no idea why the old indices are not dropped automatically nor how to do it manually.

Taking a quick look at deployment-logstash03, there is probably a full year of logs on this instance. I believe production is using curator to accomplish cleanup of indices older than 90 days. These files appear to be installed on deployment-logstash03, I'm not really sure what is invoking curator (logstash?) but it generates daily logs into /var/log/elasticsearch, so something is running it. Best guess is somewhere in there is the problem.

  • deleted logstash indices dated prior to may 2020. /srv now has ~50% available
  • closed/reopened red indices. elasticsearch crashed (AlreadyClosedException). On coming back up all indices green/yellow.
  • looks like logs are starting to flow back into deployment-logstash03
  • with the shard count reduced below 1k and logs now being ingested, logstash_checker.py now recieves expected responses from elasticsearch

While this now "works", deployment-logstash03 was in a sad state and i didn't really fix it, just got it going again. It will fail again in the same way in the future.

Thank you @EBernhardson for pointing at curator. From /var/log/syslog* it seems to run on a daily basis at 00:42, example:

Sep 16 00:42:01 deployment-logstash03 CRON[6935]: (root) CMD (/usr/bin/curator --config /etc/curator/config-logstash.yaml /etc/curator/cleanup_logstash.yaml > /dev/null)
Sep 16 00:42:01 deployment-logstash03 CRON[6934]: (root) CMD (/usr/bin/curator --config /etc/curator/config-apifeatureusage-deployment-elastic05.deployment-prep.eqiad.wmflabs.yaml /etc/curator/cleanup_apifeatureusage-deployment-elastic05.deployment-prep.eqiad.wmflabs.yaml > /dev/null)

The cleanup ones logs to /var/log/elasticsearch/curator-logstash.log which is empty. The cleanup one filters indices based on a name prefix of logstash-.

The config has:

client:
  hosts:
    - 127.0.0.1
  port: 9200
  url_prefix:
  use_ssl: False
  timeout: 30
  master_only: False

We can list the aliases with: curl 'http://127.0.0.1:9200/_cat/indices/?s=index'


I have tried to run the command manually and surely it fails:

sudo /usr/bin/curator --config /etc/curator/config-logstash.yaml /etc/curator/cleanup_logstash.yaml --dry-run
Traceback (most recent call last):
  File "/usr/bin/curator", line 11, in <module>
    load_entry_point('elasticsearch-curator==4.2.5', 'console_scripts', 'curator')()
  File "/usr/lib/python2.7/dist-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python2.7/dist-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python2.7/dist-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/curator/cli.py", line 166, in cli
    client = get_client(**client_args)
  File "/usr/lib/python2.7/dist-packages/curator/utils.py", line 603, in get_client
    'Error: {0}'.format(e)
elasticsearch.exceptions.ElasticsearchException: Unable to create client connection to Elasticsearch.  Error: Elasticsearch version 5.6.15 incompatible with this version of Curator (5.6.15)

Which is send to stderr and I guess is shallowed by CRON.

Note the script attempts to load the entry point elasticsearch-curator==4.2.5 which comes from the Debian package:

$ apt-cache policy elasticsearch-curator
elasticsearch-curator:
  Installed: 4.2.5-1
  Candidate: 4.2.5-1
  Version table:
 *** 4.2.5-1 500
        500 http://deb.debian.org/debian stretch/main amd64 Packages
        100 /var/lib/dpkg/status

Not sure why the error message states Curator is 5.6.15.

In production we seem to use a more recent package from thirdparty/elasticsearch-curator5:

modules/profile/manifests/elasticsearch.pp
apt::repository { 'wikimedia-curator':
    uri        => 'http://apt.wikimedia.org/wikimedia',
    dist       => "${::lsbdistcodename}-wikimedia",
    components => 'thirdparty/elasticsearch-curator5',
    before     => Class['::elasticsearch::curator'],
}

That is not deployed on deployment-logstash03. The host is running Stretch, hence we would get elasticsearch-curator 5.6.0 if we used the apt component:

$ curl -shttps://apt.wikimedia.org/wikimedia/dists/stretch-wikimedia/thirdparty/elasticsearch-curator5/binary-amd64/Packages|grep grep -E '^(Version|Package)'

Package: elasticsearch-curator
Version: 5.6.0

I had a quick look at /var/lib/puppet/state/classes.txt, it has the profiles/roles:

profile::base
profile::base::certificates
profile::base::labs
profile::base::puppet
profile::ldap::client::labs
profile::openstack::base::clientpackages::vms
profile::openstack::base::observerenv
profile::openstack::eqiad1::clientpackages::vms
profile::openstack::eqiad1::cumin::target
profile::openstack::eqiad1::observerenv
profile::standard
profile::wmcs::instance
role::aptly::client
role::prometheus::node_exporter
role::wmcs::instance

And I dont see a logstash profile/role :-\

In conclusion:

  • curator stderr should be logged / alarming
  • one should apply profile::elasticsearch with all the appropriate config
  • possibly find out why we dont have logstash related role/profile applied. I have no idea how the instance got provisionned originally without those.

Since we moved to the new beta logging cluster, is this issue still occurring?

taavi added a subscriber: taavi.

This seems no longer happening.