Page MenuHomePhabricator

ManagementSSHDown
Closed, ResolvedPublic

Description

Common information

  • alertname: ManagementSSHDown
  • instance: labstore1004.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: C2
  • severity: task
  • site: eqiad
  • source: prometheus
  • team: dcops

Firing alerts


  • dashboard: TODO
  • description: The management interface at labstore1004.mgmt:22 has been unresponsive for multiple hours.
  • runbook: https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
  • summary: Unresponsive management for labstore1004.mgmt:22
  • alertname: ManagementSSHDown
  • instance: labstore1004.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: C2
  • severity: task
  • site: eqiad
  • source: prometheus
  • team: dcops
  • Source

Event Timeline

msw-e1-eqiad has a few bad ports moved Management to different ports
an-presto1006
elastic1089
elastic1090

Thank you @Jclark-ctr ! FTR the task will be updated automatically once the alerts recover i.e. leaving only the hosts still alerting

mw1376 moved to different port on msw

i am only having issues with one that will need to be rebooted labstore1004

@Andrew we will need to preform flee power drain on server

We don't have a great way to safely downtime this box at the moment. We're in the process of moving load off of it entirely but that won't be complete until January at the earliest. Can we live with this in its precarious state and then just decom in a month or two?

@Andrew we are just without management on this server at this time

phaultfinder updated the task description. (Show Details)
phaultfinder updated the task description. (Show Details)

@Jclark-ctr could you take a look at db1121's mgmt cable?

Mentioned in SAL (#wikimedia-operations) [2023-03-22T11:20:32Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool needs to be rebooted T323961', diff saved to https://phabricator.wikimedia.org/P45910 and previous config saved to /var/cache/conftool/dbconfig/20230322-112031-root.json

Mentioned in SAL (#wikimedia-operations) [2023-03-22T11:30:26Z] <marostegui> Poweroff db1121 (lag will show on wikireplicas for s4 section) T323961

Cable was replaced yesterday with no luck. today preformed flea power drain on db1121

@Andrew any update on being able to reboot labstore1004

@Jclark-ctr it'll be another week or two before we have workloads moved off of this.

OK -- I'm not ready to get rid of the data on this server but it is fine to reboot it now. Thanks for waiting!

Papaul subscribed.

rebooting the server fixed the issue. We can now resolve this