Spotted at https://integration.wikimedia.org/ci/job/beta-scap-eqiad/
09:07:29 09:07:29 Executing check 'Check endpoints for deployment-mediawiki-07.deployment-prep.eqiad.wmflabs' 09:07:30 09:07:30 Check 'Check endpoints for deployment-mediawiki-07.deployment-prep.eqiad.wmflabs' failed: /wiki/{title} (Main Page) is CRITICAL: Test Main Page returned the unexpected status 302 (expecting: 200); /wiki/{title} (Special Version) is CRITICAL: Test Special Version returned the unexpected status 302 (expecting: 200) 09:07:30 09:07:30 09:07:30 Finished Canary Endpoint Check Complete (duration: 00m 00s) 09:07:30 09:07:30 Waiting for canary traffic... 09:07:49 09:07:49 Executing check 'Logstash Error rate for deployment-mediawiki-07.deployment-prep.eqiad.wmflabs' 09:07:49 09:07:49 Check 'Logstash Error rate for deployment-mediawiki-07.deployment-prep.eqiad.wmflabs' failed: Traceback (most recent call last): 09:07:49 File "/usr/local/bin/logstash_checker.py", line 332, in <module> 09:07:49 main() 09:07:49 File "/usr/local/bin/logstash_checker.py", line 328, in main 09:07:49 sys.exit(checker.run()) 09:07:49 File "/usr/local/bin/logstash_checker.py", line 233, in run 09:07:49 entries = r['aggregations']['2']['buckets'] 09:07:49 KeyError: 'aggregations' 09:07:49 09:07:49 09:07:49 Canary error check failed for 1 canaries, less than threshold to halt deployment (2/1), see https://logstash-beta.wmflabs.org/goto/ff3530979bc5a54b9779ac9cbd4fc819 for details. Continuing... 09:07:49 09:07:49 Finished sync-check-canaries (duration: 00m 42s)
https://integration.wikimedia.org/ci/job/beta-scap-eqiad/313583/console
The script is in Puppet: modules/service/files/logstash_checker.py and has not been touched in a few months.
Reproduction
With verbose mode (-v):
ssh deployment-deploy01.deployment-prep.eqiad.wmflabs \ /usr/local/bin/logstash_checker.py \ -v --service-name mwdeploy --host deployment-mediawiki-07.deployment-prep.eqiad.wmflabs --logstash-host deployment-logstash03.deployment-prep.eqiad.wmflabs:9200 \ --fail-threshold 10.0 --delay 5
DEBUG: logstash response {u'status': 400, u'error': {u'root_cause': [{u'reason': u'Trying to query 1247 shards, which is over the limit of 1000. This limit exists because querying many shards at the same time can make the job of the coordinating node very CPU and/or memory intensive. It is usually a better idea to have a smaller number of larger shards. Update [action.search.shard_count.limit] to a greater value if you really want to query that many shards at the same time.', u'type': u'illegal_argument_exception'}], u'type': u'illegal_argument_exception', u'reason': u'Trying to query 1247 shards, which is over the limit of 1000. This limit exists because querying many shards at the same time can make the job of the coordinating node very CPU and/or memory intensive. It is usually a better idea to have a smaller number of larger shards. Update [action.search.shard_count.limit] to a greater value if you really want to query that many shards at the same time.'}}