hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet
Closed, ResolvedPublicRequest
Actions

Assigned To

Authored By

	MatthewVernon
	Jun 13 2022, 9:48 AM

Description

- Provide FQDN of system.
- If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
- ~~Put system into a failed state in Netbox.~~ (systems still in swift rings)
- Provide urgency of request, along with justification (redundancy, dependencies, etc)
- Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
- Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

Two Swift backends - ms-be1057.eqiad.wmnet and ms-be1058.eqiad.wmnet have misbehaving iLOs - specifically, no remote ipmitool command works e.g.:

mvernon@cumin2002:~$ sudo ipmitool -I lanplus -H "ms-be1057.mgmt.eqiad.wmnet" -U root -E chassis power status
Unable to read password from environment
Password: 
Error: Unable to establish IPMI v2 / RMCP+ session
mvernon@cumin2002:~$ sudo ipmitool -I lanplus -H "ms-be1058.mgmt.eqiad.wmnet" -U root -E chassis power status
Unable to read password from environment
Password: 
Error: Unable to establish IPMI v2 / RMCP+ session

This means that e.g. the reimage cookbook is unable to operate (so these nodes are blocking T279637 hence the urgency). I have followed the troubleshooting instructions, and other than remote-ipmi, everything seems as you would expect:

mvernon@ms-be1057:~$ sudo ipmi-chassis --get-chassis-status
System Power                        : on
Power overload                      : false
Interlock                           : inactive
Power fault                         : false
Power control fault                 : false
Power restore policy                : Restore
Last Power Event                    : unknown
Chassis intrusion                   : inactive
Front panel lockout                 : inactive
Drive Fault                         : false
Cooling/fan fault                   : false
Chassis Identify state              : off
Power off button                    : enabled
Reset button                        : enabled
Diagnostic Interrupt button         : enabled
Standby button                      : enabled
Power off button disable            : unallowed
Reset button disable                : unallowed
Diagnostic interrupt button disable : unallowed
Standby button disable              : unallowed

mvernon@ms-be1057:~$ sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Channel_Privilege_Limit=Administrator" --key-pair="Lan_Channel:Non_Volatile_Channel_Privilege_Limit=Administrator" --diff
mvernon@ms-be1057:~$ sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff
mvernon@ms-be1057:~$ sudo facter -p ipmi_chassis.boot_flags.device
NO-OVERRIDE

[likewise ms-be1058]

I have tried a reset of the iLO via reset /map1 from the managment interface; this reset hasn't fixed it. The only other thing suggested on the troubleshooting list is a cold reset (shut down, unplug power cables) which might be worth a shot?

The complication is that both nodes are live swift backends - so they contribute to the 3x replication of data. So one or other can get shut down while you work on it, but I don't want to leave either down for a long time if possible, and particularly we can't have more than one off at once. Sorry!

Happy to co-ordinate a good time for me to shut one or other host off - either here or IRC.

Related Objects
Search...

Status	Subtype	Assigned	Task
			Restricted Task
Resolved		MatthewVernon	T306098 Cloud VPS "swift" project Stretch deprecation
Resolved		MatthewVernon	T317616 Revisit CDN<-->Swift communication
Resolved		MatthewVernon	T279637 Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options
Resolved	Request	MatthewVernon	T310478 hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet

Event Timeline

MatthewVernon triaged this task as High priority.Jun 13 2022, 9:48 AM

MatthewVernon created this task.

MatthewVernon added a parent task: T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.

Maintenance_bot added a project: SRE.Jun 13 2022, 10:29 AM

This turned out to be an incorrect config section - I've updated https://wikitech.wikimedia.org/wiki/Management_Interfaces#Is_remote_IPMI_enabled? to note this, and how to fix it on HP kit.

MatthewVernon added a project: SRE-swift-storage.Jun 13 2022, 12:26 PM

hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnetClosed, ResolvedPublicRequestActions

Description

Related ObjectsSearch...

Event Timeline

hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet
Closed, ResolvedPublicRequest
Actions

Related Objects
Search...