Page MenuHomePhabricator

es1019: reseat IPMI
Closed, ResolvedPublic

Description

Looks like es1019 has an issue with the IPMI.
Per icinga:

ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-es1019.localhost: internal IPMI error

I wanted to soft restart the IPMI, but I cannot access it, it looks down.

ssh es1019.mgmt.eqiad.wmnet -lroot
channel 0: open failed: connect failed: Connection timed out


root@cumin1001:~# ping es1019.mgmt.eqiad.wmnet
PING es1019.mgmt.eqiad.wmnet (10.65.4.44) 56(84) bytes of data.
^C
--- es1019.mgmt.eqiad.wmnet ping statistics ---
128 packets transmitted, 0 received, 100% packet loss, time 130024ms

I am not sure if the IPMI on-site reseat needs the host to go fully down, if so, please coordinate with me, as we'd need to depool it.

This is not the first occurrence on this host:
T201132
T187530
T213422
T155691

Event Timeline

Marostegui created this task.
Marostegui moved this task from Triage to Blocked external/Not db team on the DBA board.

@jcrespo yes, this seems to be an issue and when it's down I will see if there are any f/w updates for it. Is this something you need right away? I can put it on the schedule for Thursday but it will not be until my afternoon.

@Cmjohnson - Thursday sounds good. I can leave the host depooled, downtimed and off, so you can tackle it in your afternoon.
Just leave it powered on once you are done and I will take care of it in Friday when I get online.

Thank you!

yes, this seems to be an issue

I hope you understood this was a rant directed towards the machine/vendor
only and for background info. I don't think we will get rid of it until we
renew the hw, but ofc no problem on trying an upgrade.

Mentioned in SAL (#wikimedia-operations) [2020-02-06T13:51:58Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool es1019 for onsite maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10321 and previous config saved to /var/cache/conftool/dbconfig/20200206-135157-marostegui.json

Change 570643 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] es1019: Disable notifications

https://gerrit.wikimedia.org/r/570643

Change 570643 merged by Marostegui:
[operations/puppet@production] es1019: Disable notifications

https://gerrit.wikimedia.org/r/570643

Mentioned in SAL (#wikimedia-operations) [2020-02-06T13:55:11Z] <marostegui> Stop MySQL on es1019, upgrade and poweroff for on-site maintenance - T243963

@Cmjohnson es1019 is off.
Once you are done, just start it back and I will it from there.

Thank you!

From what I can see this maintenance did not happen yesterday in the end - as the host is still off and its IPMI is unreachable. And as the IPMI is involved, I cannot power this host back on for the weekend.
Even if we don't troubleshoot the IPMI today, can someone power it back on today so we don't go thru the weekend without it - and also provide me a new date for the IPMI reseat?

Thanks

performed flea power drain. powered on host

Thanks - IPMI is back. I will take it from here.
Thank you!

Mentioned in SAL (#wikimedia-operations) [2020-02-07T17:19:04Z] <marostegui> Start MySQL on es1019 after onsite maintenance T243963

Mentioned in SAL (#wikimedia-operations) [2020-02-07T17:25:42Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool es1019 after on-site maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10349 and previous config saved to /var/cache/conftool/dbconfig/20200207-172541-marostegui.json

John found this: https://www.dell.com/support/article/es/es/esbsdt1/sln316859/idrac7-idrac8-idrac-unresponsive-or-sluggish-performance?lang=en which is an update from May 2019, so maybe we should try it.
Should we schedule a maintenance window to get it update?

Mentioned in SAL (#wikimedia-operations) [2020-02-07T17:38:50Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool es1019 after on-site maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10350 and previous config saved to /var/cache/conftool/dbconfig/20200207-173850-marostegui.json

@Marostegui We can upgrade the f/w. That can be anytime, please pick a convenient date for you.

@Marostegui We can upgrade the f/w. That can be anytime, please pick a convenient date for you.

Can we do it tomorrow at the most convenient time for you?

@Cmjohnson I can have the host depooled and off tomorrow in the UTC morning so you can do it whenever you can tomorrow, and once done, just power it back on. Would that work?

@Cmjohnson - looks like it's too late to do this one today, since they need 24hrs to depool. Chatted with @Marostegui briefly, so just let know them know when's a good date for you to upgrade the firmware, and they can prep for it then. Thanks, Willy

@Cmjohnson - looks like it's too late to do this one today, since they need 24hrs to depool. Chatted with @Marostegui briefly, so just let know them know when's a good date for you to upgrade the firmware, and they can prep for it then. Thanks, Willy

Any ETA?
Thanks!

Mentioned in SAL (#wikimedia-operations) [2020-02-25T12:46:51Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool es1019 for on-site maintenance - T243963', diff saved to https://phabricator.wikimedia.org/P10512 and previous config saved to /var/cache/conftool/dbconfig/20200225-124650-marostegui.json

I attempted to update the idrac f/w for es1019 but the update failed several times for not being able to verify package signature. The update was downloaded directly from dell's portal and I did not see any way of verifying the signature. I did remove and download it again and still failed. I also attempted to load directly from a USB stick but the file was not seen by the lifecycle controller and failed.

JID_826621862819 Firmware Update: iDRAC Failed

Start Time: Not Applicable
Expiration Time: Not Applicable
Message: RED007: Unable to verify Update Package signature.

Thanks Chris for tackling this.
Let's not spend more time on this host, it has a big history of failing idrac :(
So let's just make sure it is available and if it fails in a few months again, let's reseat it again.
This host is meant to be refreshed next FY, so we will be able to get rid of it in a few months time hopefully.

Thanks!

Will close this, then, once the host is fully back into production.

The reseat was completed but idrac f/w updated failed. resolving the task and will do flea power drains if or when idrac freezes again.

Mentioned in SAL (#wikimedia-operations) [2020-02-25T16:02:15Z] <jynus@cumin1001> dbctl commit (dc=all): 'repool es1019 with low load after maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10516 and previous config saved to /var/cache/conftool/dbconfig/20200225-160215-jynus.json

Mentioned in SAL (#wikimedia-operations) [2020-02-25T17:21:34Z] <jynus@cumin1001> dbctl commit (dc=all): 'increase es1019 load to 50% T243963', diff saved to https://phabricator.wikimedia.org/P10519 and previous config saved to /var/cache/conftool/dbconfig/20200225-172133-jynus.json

es1019 is just pending the last config push back to normal traffic weights (and reducing the master's).

Mentioned in SAL (#wikimedia-operations) [2020-02-26T06:16:41Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Restore es1017 (master) original weight (0) T243963', diff saved to https://phabricator.wikimedia.org/P10529 and previous config saved to /var/cache/conftool/dbconfig/20200226-061640-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-26T06:17:11Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Fully repool es1019 after on-site maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10530 and previous config saved to /var/cache/conftool/dbconfig/20200226-061710-marostegui.json

es1019 is just pending the last config push back to normal traffic weights (and reducing the master's).

Done!
Thank you for handling this