Page MenuHomePhabricator

scap on beta cluster does not run anymore due to logstash being down
Closed, ResolvedPublic

Description

00:01:18.014 07:55:51 Waiting for canary traffic...
00:01:38.036 07:56:12 Executing check 'Logstash Error rate for deployment-mediawiki01.deployment-prep.eqiad.wmflabs'
00:01:48.141 07:56:22 Check 'Logstash Error rate for deployment-mediawiki01.deployment-prep.eqiad.wmflabs' failed: Traceback (most recent call last):
00:01:48.141   File "/usr/local/bin/logstash_checker.py", line 268, in <module>
00:01:48.141     main()
00:01:48.141   File "/usr/local/bin/logstash_checker.py", line 264, in main
00:01:48.141     sys.exit(checker.run())
00:01:48.141   File "/usr/local/bin/logstash_checker.py", line 166, in run
00:01:48.141     body=self._logstash_query()
00:01:48.141   File "/usr/local/bin/logstash_checker.py", line 85, in fetch_url
00:01:48.141     "downloading {}".format(url))
00:01:48.141 __main__.CheckServiceError: Timeout on connection while downloading deployment-logstash2.deployment-prep.eqiad.wmflabs:9200/logstash-*/_search
00:01:48.141

Event Timeline

Debian GNU/Linux 8 deployment-logstash2 ttyS0

deployment-logstash2 login: [2348521.020179] INFO: task jbd2/vda3-8:167 blocked for more than 120 seconds.
[2348521.027931]       Not tainted 3.16.0-4-amd64 #1
[2348521.028429] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2348521.029991] INFO: task jbd2/dm-0-8:311 blocked for more than 120 seconds.
[2348521.031129]       Not tainted 3.16.0-4-amd64 #1
[2348521.031619] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2348521.032620] INFO: task nscd:612 blocked for more than 120 seconds.
[2348521.033298]       Not tainted 3.16.0-4-amd64 #1
[2348521.033788] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2348521.035209] INFO: task nscd:613 blocked for more than 120 seconds.
[2348521.035865]       Not tainted 3.16.0-4-amd64 #1
[2348521.036377] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2348521.037296] INFO: task java:9783 blocked for more than 120 seconds.
[2348521.037950]       Not tainted 3.16.0-4-amd64 #1
[2348521.039307] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2348521.040611] INFO: task java:9823 blocked for more than 120 seconds.
[2348521.041290]       Not tainted 3.16.0-4-amd64 #1
[2348521.041775] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2348521.043008] INFO: task java:11191 blocked for more than 120 seconds.
[2348521.043684]       Not tainted 3.16.0-4-amd64 #1
[2348521.044197] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2348521.045073] INFO: task java:18437 blocked for more than 120 seconds.
[2348521.045743]       Not tainted 3.16.0-4-amd64 #1
[2348521.046447] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2348521.047419] INFO: task <tcp:18449 blocked for more than 120 seconds.
[2348521.048132]       Not tainted 3.16.0-4-amd64 #1
[2348521.048614] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2348521.049571] INFO: task <syslog:18450 blocked for more than 120 seconds.
[2348521.051502]       Not tainted 3.16.0-4-amd64 #1
[2348521.051995] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

nscd / syslog / tcp / jbd2/vda3 etc are blocked somehow.

Mentioned in SAL [2016-08-26T08:07:00Z] <hashar> rebooting deployment-logstash02 via Horizon. Kernel hang apparently T143982

Mentioned in SAL [2016-08-26T08:10:28Z] <hashar> deployment-logstash2 is back after a hard reboot. T143982

Mentioned in SAL [2016-08-26T08:10:58Z] <hashar> beta-scap-eqiad job is back in operation. Was blocked on logstash not being reachable. T143982

hashar claimed this task.
hashar triaged this task as Unbreak Now! priority.