Page MenuHomePhabricator

SSH on cp5012.mgmt is flapping (CRITICAL)
Closed, ResolvedPublic

Description

SSH on cp5012.mgmt has a socket timeout and this has been happening for more than a week. I promised Willy I will file a task but I forgot so here's my attempt at fixing that!

#wikimedia-operations.log:13:10 <+icinga-wm> PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook

This is just on cp5012 in eqsin as far as I can tell. No other issues on the host, at least not on the non-mgmt side.

Event Timeline

We have had the "mgmt flapping"-issue in other DCs. In codfw a bunch of them were fixed after Papaul did firmware upgrades on the DRACs.

So I would suggest to check if you can get a firmware upgrade done in eqsin.

We have had the "mgmt flapping"-issue in other DCs. In codfw a bunch of them were fixed after Papaul did firmware upgrades on the DRACs.

So I would suggest to check if you can get a firmware upgrade done in eqsin.

Thanks @Dzahn! I will search for those tasks.

So if the idrac is accessible, the firmware update isn't OS impacting. However, I cannot login to this idrac interface via HTTPS or SSH, so it appears it'll have to be fully power drained to attempt to fix this issue. Is there anything preventing me from depooling this via command line and rebooting the host as needed?

@ssingh: ok for me to depool and reboot this host as needed via depool command in os?

So if the idrac is accessible, the firmware update isn't OS impacting. However, I cannot login to this idrac interface via HTTPS or SSH, so it appears it'll have to be fully power drained to attempt to fix this issue. Is there anything preventing me from depooling this via command line and rebooting the host as needed?

@ssingh: ok for me to depool and reboot this host as needed via depool command in os?

I am happy to take care of depooling this host next week, @RobH and will update this task when ready. Thank you for the help!

If it is just 'depool' from command line+stop puppet+icinga maint mode, I can handle so you don't need to take it down in advance of the work, just lemme know!

Mentioned in SAL (#wikimedia-operations) [2022-06-27T19:50:34Z] <sukhe@cumin2002> START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cp5012.eqsin.wmnet with reason: depooled: flapping mgmt interface: T311264

Mentioned in SAL (#wikimedia-operations) [2022-06-27T19:50:40Z] <sukhe@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cp5012.eqsin.wmnet with reason: depooled: flapping mgmt interface: T311264

If it is just 'depool' from command line+stop puppet+icinga maint mode, I can handle so you don't need to take it down in advance of the work, just lemme know!

Sorry, I should have followed up today! The server is depooled and ready for you to proceed.

For posterity, the steps to depool cp hosts (from puppetmaster);

501  sudo confctl select 'name=cp5012.eqsin.wmnet,service=ats-be' set/pooled=no
502  sudo confctl select 'name=cp5012.eqsin.wmnet,service=varnish-fe' set/pooled=no
503  sudo confctl select 'name=cp5012.eqsin.wmnet,service=ats-tls' set/pooled=no

Mentioned in SAL (#wikimedia-sre) [2022-06-27T19:53:45Z] <robh> cp5012 shutting down and removing power via T311264

Hello, in case it's helpful, I fixed one of these the other day in eqiad by using ipmitool mc reset cold to do a cold reset of the BMC.
T311042: aqs1008.mgmt interface SSH check flapping

Hello, in case it's helpful, I fixed one of these the other day in eqiad by using ipmitool mc reset cold to do a cold reset of the BMC.
T311042: aqs1008.mgmt interface SSH check flapping

I'll have to give that a try next time, thanks!

updated idrac from 2.50.x to 2.81.81.81, A00

@ssingh, this should clear up our errors, i can login to the idrac and system is powered back up.

once this is back online in service feel free to resolve this task!

I've happened upon this tracking ticket where many similar SSH related mgmt checks are mentioned: T304289: Management interface SSH icinga alerts
Mentioning it here for cross-referencing purposes.

Thanks for all the help @RobH! Marking this as resolved as the host is now pooled.