Page MenuHomePhabricator

mw2220 - broken IPMI / mgmt
Closed, ResolvedPublic

Description

mw2220.codfw.wmnet has broken IPMI / mgmt / DRAC.

It showed up when using the reimaging script failed with failed remote IPMI.

I started the debugging steps from https://wikitech.wikimedia.org/wiki/Management_Interfaces#Does_IPMI_work_locally?

and it fails right away locally.. and not with the "typical" error but an unusual one

[mw2220:~] $ sudo ipmi-chassis --get-chassis-status
ipmi_cmd_get_chassis_status: bad completion code

Also the other test commands, fail in unusual ways:

[mw2220:~] $ sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff
Unable to get Number of Users

And finally I also can't ssh to mgmt to reset DRAC:

ssh root@mw2220.mgmt.codfw.wmnet

channel 0: open failed: connect failed: Connection timed out
stdio forwarding failed
ssh_exchange_identification: Connection closed by remote host

Could you please check on it locally?

Event Timeline

Dzahn created this task.Wed, Feb 3, 6:38 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptWed, Feb 3, 6:38 PM
wiki_willy assigned this task to Papaul.Fri, Feb 5, 10:27 PM
wiki_willy added a project: DC-Ops.

@Dzahn if the server is still in service can it please be de-pool so i can work on it tomorrow while on site.

Thanks

Mentioned in SAL (#wikimedia-operations) [2021-02-08T23:52:46Z] <dzahn@cumin1001> START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on mw2220.codfw.wmnet with reason: T273803

Mentioned in SAL (#wikimedia-operations) [2021-02-08T23:52:51Z] <dzahn@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on mw2220.codfw.wmnet with reason: T273803

Dzahn added a comment.Mon, Feb 8, 11:54 PM

@Papaul Server is set to pooled=inactive and downtime for 2 days. Go ahead and thank you!

Thank you will work on it tomorrow.

Papaul closed this task as Resolved.Tue, Feb 9, 4:05 PM

@Dzahn
Drained power and upgrade IDRAC firmware from 2.30.30.30 to 2.63.

All looks good now

Dzahn added a comment.Tue, Feb 9, 5:04 PM

Thank you @Papaul. Reimaging it now and things look normal.