Page MenuHomePhabricator

lvs2013 ManagementSSHDown
Closed, ResolvedPublic

Description

Common information

  • alertname: ManagementSSHDown
  • instance: lvs2013.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: C2
  • severity: task
  • site: codfw
  • source: prometheus
  • team: dcops

Firing alerts


  • dashboard: TODO
  • description: The management interface at lvs2013.mgmt:22 has been unresponsive for multiple hours.
  • runbook: https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
  • summary: Unresponsive management for lvs2013.mgmt:22
  • alertname: ManagementSSHDown
  • instance: lvs2013.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: C2
  • severity: task
  • site: codfw
  • source: prometheus
  • team: dcops
  • Source

Event Timeline

I merged all duplicates, there were timeouts from phalerts talking to the phab api:

Jul 03 11:51:11 alert1001 phalerts[29762]: 2023-07-03 11:51:11,937 INFO: Looking for tasks with title='ManagementSSHDown' in ['PHID-PROJ-heihjeaiasruuvneirzh']
Jul 03 11:51:12 alert1001 phalerts[29762]: 2023-07-03 11:51:12,099 INFO: Creating a task with title='ManagementSSHDown' in ['PHID-PROJ-heihjeaiasruuvneirzh']
Jul 03 11:51:16 alert1001 phalerts[29762]: 2023-07-03 11:51:16,789 INFO: 2620:0:861:101:10:64:0:82 - - [03/Jul/2023 11:51:16] "GET /metrics HTTP/1.1" 200 -
Jul 03 11:51:17 alert1001 phalerts[29762]: 2023-07-03 11:51:17,108 ERROR: Exception on /alerts [POST]
Jul 03 11:51:17 alert1001 phalerts[29762]: Traceback (most recent call last):
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/lib/python3/dist-packages/flask/app.py", line 2292, in wsgi_app
Jul 03 11:51:17 alert1001 phalerts[29762]:     response = self.full_dispatch_request()
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/lib/python3/dist-packages/flask/app.py", line 1815, in full_dispatch_request
Jul 03 11:51:17 alert1001 phalerts[29762]:     rv = self.handle_user_exception(e)
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/lib/python3/dist-packages/flask/app.py", line 1718, in handle_user_exception
Jul 03 11:51:17 alert1001 phalerts[29762]:     reraise(exc_type, exc_value, tb)
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/lib/python3/dist-packages/flask/_compat.py", line 35, in reraise
Jul 03 11:51:17 alert1001 phalerts[29762]:     raise value
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/lib/python3/dist-packages/flask/app.py", line 1813, in full_dispatch_request
Jul 03 11:51:17 alert1001 phalerts[29762]:     rv = self.dispatch_request()
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/lib/python3/dist-packages/flask/app.py", line 1799, in dispatch_request
Jul 03 11:51:17 alert1001 phalerts[29762]:     return self.view_functions[rule.endpoint](**req.view_args)
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "<decorator-gen-1>", line 2, in alerts
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/lib/python3/dist-packages/prometheus_client/context_managers.py", line 66, in wrapped
Jul 03 11:51:17 alert1001 phalerts[29762]:     return func(*args, **kwargs)
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/bin/phalerts", line 244, in alerts
Jul 03 11:51:17 alert1001 phalerts[29762]:     request.args.getlist("phid"))
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/bin/phalerts", line 199, in process_task
Jul 03 11:51:17 alert1001 phalerts[29762]:     create_task(title, description, phids)
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/bin/phalerts", line 112, in create_task
Jul 03 11:51:17 alert1001 phalerts[29762]:     result = phab_request(phab.maniphest.edit, transactions=transactions)
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/bin/phalerts", line 69, in phab_request
Jul 03 11:51:17 alert1001 phalerts[29762]:     result = api_func(**kwargs)
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/lib/python3/dist-packages/phabricator/__init__.py", line 248, in __call__
Jul 03 11:51:17 alert1001 phalerts[29762]:     return self._request(**kwargs)
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/lib/python3/dist-packages/phabricator/__init__.py", line 309, in _request
Jul 03 11:51:17 alert1001 phalerts[29762]:     response = conn.getresponse()
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/lib/python3.7/http/client.py", line 1352, in getresponse
Jul 03 11:51:17 alert1001 phalerts[29762]:     response.begin()
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/lib/python3.7/http/client.py", line 310, in begin
Jul 03 11:51:17 alert1001 phalerts[29762]:     version, status, reason = self._read_status()
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/lib/python3.7/http/client.py", line 271, in _read_status
Jul 03 11:51:17 alert1001 phalerts[29762]:     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/lib/python3.7/socket.py", line 589, in readinto
Jul 03 11:51:17 alert1001 phalerts[29762]:     return self._sock.recv_into(b)
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/lib/python3.7/ssl.py", line 1052, in recv_into
Jul 03 11:51:17 alert1001 phalerts[29762]:     return self.read(nbytes, buffer)
Jul 03 11:51:17 alert1001 phalerts[29762]:   File "/usr/lib/python3.7/ssl.py", line 911, in read
Jul 03 11:51:17 alert1001 phalerts[29762]:     return self._sslobj.read(len, buffer)
Jul 03 11:51:17 alert1001 phalerts[29762]: socket.timeout: The read operation timed out
Vgutierrez subscribed.

unresponsive management interface results in puppet being super slow loading facts:

Jul  3 15:41:54 lvs2013 puppet-agent[1417206]: Loading facts
Jul  3 15:58:08 lvs2013 puppet-agent[1417206]: Caching catalog for lvs2013.codfw.wmnet
ssingh renamed this task from ManagementSSHDown to lvs2013 ManagementSSHDown.Jul 5 2023, 1:32 PM

I found the idrac light blinking rapidly in amber. Quick Sync is not responding. I tried rebooting just the idrac but it hasn't helped. The next troubleshooting step is to reboot the server.

@ssingh or @Vgutierrez can one of you help me with getting this depooled briefly so I can reboot? I am on site for the next 3 hours from this post. after that I will be here at roughly the same time tomorrow if that works better.

Mentioned in SAL (#wikimedia-operations) [2023-07-05T13:41:13Z] <sukhe> disable puppet and stop pybal on lvs2013: T340960

Icinga downtime and Alertmanager silence (ID=f6099155-97b3-49c3-9c11-36962a3c834b) set by vgutierrez@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: mgmt interface issues

lvs2013.codfw.wmnet
ssingh claimed this task.

Thanks to @Jhancock.wm for the quick resolution of this issue!