Page MenuHomePhabricator

db2087 internal IPMI error
Closed, ResolvedPublic

Description

db2087 IPMI check is failing since 2020-06-30 with the following error:

db2087 IPMI Sensor Status	UNKNOWN ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-db2087.localhost: internal IPMI error

Local and remote IPMI access fails with ipmi_cmd_get_chassis_status: bad completion code / Error: Unable to establish IPMI v2 / RMCP+ session. SSH connetion is unavailable.

While this is not an immediate concern, it would block a restart/reimage of the server without having someone onsire (which I am guessing it is a possibility with upcoming vacation times).

@Marostegui told me he tried to debug it remotely (and I double checked) but it appears it requires local power drain/reset or otherwise local check.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
wiki_willy added a project: DC-Ops.

@jcrespo - just a heads up, we won't have anyone onsite in the next couple weeks, so this may need to wait until August or we could utilize remote hands beforehand. Thanks, Willy

No urgency then, this has been ongoing for a few days and I checked and there is no planned maintenance and no user impact, but it would be nice to have it done by September when the switchover is planned.

I will be on site tomorrow Monday, Please de-pool server. Thanks

Mentioned in SAL (#wikimedia-operations) [2020-07-27T04:58:34Z] <marostegui> Stop MySQL on db2087 for on-site maintenance T258587

Mentioned in SAL (#wikimedia-operations) [2020-07-27T05:00:59Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2087:3316, db2087:3317 for on-site maintenance T258587', diff saved to https://phabricator.wikimedia.org/P12042 and previous config saved to /var/cache/conftool/dbconfig/20200727-050058-marostegui.json

Thank you @Papaul - server depool and powered off.
Once you are done, please power it back up

Thanks!

Before
BIOS Version
2.4.3
Firmware Version
2.40.40.40
Lifecycle Controller Firmware
2.40.40.40
This is complete
After
BIOS Version
2.11.0
Firmware Version
2.70.70.70
Lifecycle Controller Firmware
2.70.70.70

Mentioned in SAL (#wikimedia-operations) [2020-07-27T16:33:11Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db2087:3316, db2087:3317 after on-site maintenance T258587', diff saved to https://phabricator.wikimedia.org/P12063 and previous config saved to /var/cache/conftool/dbconfig/20200727-163311-marostegui.json