Page MenuHomePhabricator

mw2220 - broken IPMI / mgmt
Closed, ResolvedPublic

Description

mw2220.codfw.wmnet has broken IPMI / mgmt / DRAC.

It showed up when using the reimaging script failed with failed remote IPMI.

I started the debugging steps from https://wikitech.wikimedia.org/wiki/Management_Interfaces#Does_IPMI_work_locally?

and it fails right away locally.. and not with the "typical" error but an unusual one

[mw2220:~] $ sudo ipmi-chassis --get-chassis-status
ipmi_cmd_get_chassis_status: bad completion code

Also the other test commands, fail in unusual ways:

[mw2220:~] $ sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff
Unable to get Number of Users

And finally I also can't ssh to mgmt to reset DRAC:

ssh root@mw2220.mgmt.codfw.wmnet

channel 0: open failed: connect failed: Connection timed out
stdio forwarding failed
ssh_exchange_identification: Connection closed by remote host

Could you please check on it locally?

Related Objects

StatusSubtypeAssignedTask
Stalledtstarling
StalledNone
StalledNone
StalledNone
StalledNone
OpenNone
StalledNone
StalledNone
InvalidNone
StalledNone
StalledNone
OpenNone
OpenPRODUCTION ERRORNone
Resolvedtstarling
OpenNone
ResolvedKrinkle
OpenNone
ResolvedJdforrester-WMF
ResolvedDzahn
ResolvedPapaul

Event Timeline

@Dzahn if the server is still in service can it please be de-pool so i can work on it tomorrow while on site.

Thanks

Mentioned in SAL (#wikimedia-operations) [2021-02-08T23:52:46Z] <dzahn@cumin1001> START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on mw2220.codfw.wmnet with reason: T273803

Mentioned in SAL (#wikimedia-operations) [2021-02-08T23:52:51Z] <dzahn@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on mw2220.codfw.wmnet with reason: T273803

@Papaul Server is set to pooled=inactive and downtime for 2 days. Go ahead and thank you!

Thank you will work on it tomorrow.

@Dzahn
Drained power and upgrade IDRAC firmware from 2.30.30.30 to 2.63.

All looks good now

Thank you @Papaul. Reimaging it now and things look normal.