Page MenuHomePhabricator

ManagementSSHDown
Closed, ResolvedPublic

Description

Common information

  • alertname: ManagementSSHDown
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • severity: task
  • site: eqsin
  • source: prometheus
  • team: dcops

Firing alerts


  • dashboard: TODO
  • description: The management interface at cr3-eqsin.mgmt:22 has been unresponsive for multiple hours.
  • runbook: https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
  • summary: Unresponsive management for cr3-eqsin.mgmt:22
  • alertname: ManagementSSHDown
  • instance: cr3-eqsin.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: 603
  • severity: task
  • site: eqsin
  • source: prometheus
  • team: dcops
  • Source

  • dashboard: TODO
  • description: The management interface at asw1-eqsin.mgmt:22 has been unresponsive for multiple hours.
  • runbook: https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
  • summary: Unresponsive management for asw1-eqsin.mgmt:22
  • alertname: ManagementSSHDown
  • instance: asw1-eqsin.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: 604
  • severity: task
  • site: eqsin
  • source: prometheus
  • team: dcops
  • Source

  • dashboard: TODO
  • description: The management interface at cr2-eqsin.mgmt:22 has been unresponsive for multiple hours.
  • runbook: https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
  • summary: Unresponsive management for cr2-eqsin.mgmt:22
  • alertname: ManagementSSHDown
  • instance: cr2-eqsin.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: 604
  • severity: task
  • site: eqsin
  • source: prometheus
  • team: dcops
  • Source

Event Timeline

RobH added subscribers: wiki_willy, RobH.

Ok I went away for the weekend and came back with 100s of notifications from the SSH down tasks.

These seem to be false positives and fire too often, who can we chat with to raise the threshholds on these alerts?

Checking the list:

PING cp5017.mgmt.eqsin.wmnet (10.132.128.3) 56(84) bytes of data.
64 bytes from cp5017.mgmt.eqsin.wmnet (10.132.128.3): icmp_seq=1 ttl=60 time=245 ms

PING cp5019.mgmt.eqsin.wmnet (10.132.128.9) 56(84) bytes of data.
64 bytes from wmf125019.mgmt.eqsin.wmnet (10.132.128.9): icmp_seq=1 ttl=60 time=240 ms

PING cp5021.mgmt.eqsin.wmnet (10.132.128.17) 56(84) bytes of data.
64 bytes from wmf125021.mgmt.eqsin.wmnet (10.132.128.17): icmp_seq=1 ttl=60 time=243 ms

PING cp5023.mgmt.eqsin.wmnet (10.132.128.19) 56(84) bytes of data.
64 bytes from cp5023.mgmt.eqsin.wmnet (10.132.128.19): icmp_seq=1 ttl=60 time=229 ms

PING cp5025.mgmt.eqsin.wmnet (10.132.128.21) 56(84) bytes of data.
64 bytes from wmf125025.mgmt.eqsin.wmnet (10.132.128.21): icmp_seq=1 ttl=60 time=229 ms

PING cp5027.mgmt.eqsin.wmnet (10.132.128.23) 56(84) bytes of data.
64 bytes from cp5027.mgmt.eqsin.wmnet (10.132.128.23): icmp_seq=1 ttl=60 time=229 ms

PING cp5029.mgmt.eqsin.wmnet (10.132.128.14) 56(84) bytes of data.
64 bytes from cp5029.mgmt.eqsin.wmnet (10.132.128.14): icmp_seq=1 ttl=60 time=249 ms

PING cp5031.mgmt.eqsin.wmnet (10.132.128.16) 56(84) bytes of data.
64 bytes from wmf125031.mgmt.eqsin.wmnet (10.132.128.16): icmp_seq=1 ttl=60 time=233 ms

PING dns5003.mgmt.eqsin.wmnet (10.132.128.32) 56(84) bytes of data.
64 bytes from wmf125040.mgmt.eqsin.wmnet (10.132.128.32): icmp_seq=1 ttl=60 time=236 ms

robh@cumin1001:~$ ping ganeti5005.mgmt.eqsin.wmnet
PING ganeti5005.mgmt.eqsin.wmnet (10.132.128.29) 56(84) bytes of data.

PING ganeti5007.mgmt.eqsin.wmnet (10.132.128.31) 56(84) bytes of data.
64 bytes from wmf125039.mgmt.eqsin.wmnet (10.132.128.31): icmp_seq=1 ttl=60 time=236 ms

PING lvs5005.mgmt.eqsin.wmnet (10.132.128.27) 56(84) bytes of data.
64 bytes from lvs5005.mgmt.eqsin.wmnet (10.132.128.27): icmp_seq=1 ttl=60 time=235 ms

So for every single one checked is online, and icinga itself shows items that are still down. All of the SSH down tasks I've checked for the last few days have all been resolved by the time I get to them, these notices are false positives or transient network ping issues that resolve themselves.

@wiki_willy: How can we get these alert tasks fixed to be useful? So far I spent a few hours on this on Friday and an hour checking items this AM and they are all fine. These SSH down mgmt tasks seem to be firing off far too often to be useful. Thoughts on how or who to chat with to fix?

Now this is assigned to you, you'll likely notice a 50+ false or warn event (not hard downs as far as i can tell) on it every day or so ; D

wiki_willy added a subscriber: fgiunchedi.

@RobH - can you work with @fgiunchedi on this? This ties back to T310266, when the alert was first rolled out. But if you're able to ssh in and it continues to alert, I'm thinking maybe there's a connection issue from the host that the alert is ssh'ing in from for the specific check.

Thanks,
Willy

@RobH - can you work with @fgiunchedi on this? This ties back to T310266, when the alert was first rolled out. But if you're able to ssh in and it continues to alert, I'm thinking maybe there's a connection issue from the host that the alert is ssh'ing in from for the specific check.

Apologies for the alert spam -- definitely not intended, I believe these are related to the commissioning of the prometheus hosts in PoPs (T309979: Upgrade Prometheus VMs in PoPs to Bullseye) and ACLs not being updated to include the new IPs. I'll followup on T309979 with next actions, sorry again for the false positives!

ACLs updated, and I'm optimistically resolving this task (and related to mgmt in PoPs)

Thanks @fgiunchedi !

ACLs updated, and I'm optimistically resolving this task (and related to mgmt in PoPs)