Page MenuHomePhabricator

mw2220 - broken IPMI / mgmt
Closed, ResolvedPublic

Description

mw2220.codfw.wmnet has broken IPMI / mgmt / DRAC.

It showed up when using the reimaging script failed with failed remote IPMI.

I started the debugging steps from https://wikitech.wikimedia.org/wiki/Management_Interfaces#Does_IPMI_work_locally?

and it fails right away locally.. and not with the "typical" error but an unusual one

[mw2220:~] $ sudo ipmi-chassis --get-chassis-status
ipmi_cmd_get_chassis_status: bad completion code

Also the other test commands, fail in unusual ways:

[mw2220:~] $ sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff
Unable to get Number of Users

And finally I also can't ssh to mgmt to reset DRAC:

ssh root@mw2220.mgmt.codfw.wmnet

channel 0: open failed: connect failed: Connection timed out
stdio forwarding failed
ssh_exchange_identification: Connection closed by remote host

Could you please check on it locally?

Event Timeline

@Dzahn if the server is still in service can it please be de-pool so i can work on it tomorrow while on site.

Thanks

Mentioned in SAL (#wikimedia-operations) [2021-02-08T23:52:46Z] <dzahn@cumin1001> START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on mw2220.codfw.wmnet with reason: T273803

Mentioned in SAL (#wikimedia-operations) [2021-02-08T23:52:51Z] <dzahn@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on mw2220.codfw.wmnet with reason: T273803

@Papaul Server is set to pooled=inactive and downtime for 2 days. Go ahead and thank you!

Thank you will work on it tomorrow.

@Dzahn
Drained power and upgrade IDRAC firmware from 2.30.30.30 to 2.63.

All looks good now

Thank you @Papaul. Reimaging it now and things look normal.