Page MenuHomePhabricator

ManagementSSHDown
Closed, ResolvedPublic

Description

Common information

  • alertname: ManagementSSHDown
  • instance: cp2035.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: C2
  • severity: task
  • site: codfw
  • source: prometheus
  • team: dcops

Firing alerts


Event Timeline

server found with a rapidly blinking amber idrac and no connection on the mgmt port. according to the blink code the system needs a reboot.
I attempted to reboot just the idrac but it did not fix the issue.

@jcrespo can I get your help depooling this server when you are free. I've tried to reboot the idrac and it's not taking. I believe we need to reboot the whole server to fix this issue. Blinking Amber—Indicates that iDRAC Quick Sync 2 hardware is not responding correctly. I tried a new cable and but it persists. thanks!

Hi, let me know how I can help, but if I understand it rightly, cp2035 is a Traffic host, so better contacting either @Vgutierrez on Europe time or @BBlack in US time (it has nothing to do with my team, data persistence). I can depool it in an emergency following the written procedure, but under normal circumstances, they better handle it. 0:-)

@jcrespo thank you for the insight!

@BBlack could you assist me with this? when would be a good time for this? I know we're about to go into a holiday weekend. but the server itself is not impacted, just the idrac.

Mentioned in SAL (#wikimedia-operations) [2023-05-31T15:45:48Z] <vgutierrez> cp2035 depooled as puppet is unable to run due to ipmi issues - T337247

@jcrespo thank you for the insight!

@BBlack could you assist me with this? when would be a good time for this? I know we're about to go into a holiday weekend. but the server itself is not impacted, just the idrac.

cp2035 is currently depooled, please feel free to work on it as soon as you are able to @Jhancock.wm. Thanks!

server remained reachable by ssh for 2 days. resolving.