Page MenuHomePhabricator

es1019 IPMI and its management interface are unresponsive (again)
Closed, ResolvedPublic

Description

For the fifth time, this happens: T120689 T155691 T187530 T201132

At least, this is now detected automatically:

es1019	IPMI Sensor Status:	UNKNOWN	2019-01-10 12:42:40	31d 12h 55m 2s	3/3	ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-es1019.localhost: internal IPMI error

Both SSH and IPMI interfaces are down.

Event Timeline

jcrespo triaged this task as Medium priority.

I will first try remote debugging techniques myself.

@jcrespo you could try first any of the known/listed things in https://wikitech.wikimedia.org/wiki/Management_Interfaces (aliased from IPMI) and of course feel free to expand it if it's something different.

jcrespo added a subscriber: Cmjohnson.

@Volans I have no ssh, https or ipmi access, so there is nothing I can do about it. This needs a power drain.

Please ping us in advance @Cmjohnson or let's schedule for a time well in advance for the maintenance as the host is still in production (and it may take some hours to switch production traffic away)..

Change 483415 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool es1019 for maintenance

https://gerrit.wikimedia.org/r/483415

Change 483415 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool es1019 for maintenance

https://gerrit.wikimedia.org/r/483415

I need to power this off and unplug it for 10-20 secs. LMK if I can do
that today

@Cmjohnson Sorry, cannot today for both organizational reasons (@ at meeting today) and technical ones (cannot depool today due to traffic without being too disruptive). Let's try Tuesday if you are ok with that?

Mentioned in SAL (#wikimedia-operations) [2019-01-15T16:00:58Z] <jynus> stop es1019 for hw maintenance T213422

Waiting for Chris to be available to fully shutdown it (as otherwise I wouldn't be able to put it back up).

@Cmjohnson Please see if there is any update to the mgmt firmware, as this has happened 5 times already to avoid pinging you so often.

es1019 is back up and mgmt is working. Not starting mysql though, until chris confirms everthing is done and ok (and he is understandingly busy right now).

es1019 is back up and mgmt is working. Not starting mysql though, until chris confirms everthing is done and ok (and he is understandingly busy right now).

I think we are good to go

While the server was down I updated, BIOS, raid firmware and hardware
firmware to the latest updates

Change 484673 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Repool es1019 with low load after maintenance

https://gerrit.wikimedia.org/r/484673

Change 484673 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Repool es1019 with low load after maintenance

https://gerrit.wikimedia.org/r/484673

jcrespo reassigned this task from jcrespo to Cmjohnson.

es1019 is back into service.

This has happened again - I guess a cold reset is needed?:

09:52:20 <+icinga-wm> PROBLEM - Host es1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
root@cumin1001:~# ping -c10 es1019.mgmt.eqiad.wmnet
PING es1019.mgmt.eqiad.wmnet (10.65.4.44) 56(84) bytes of data.

--- es1019.mgmt.eqiad.wmnet ping statistics ---
10 packets transmitted, 0 received, 100% packet loss, time 9203ms

Locally, it doesn't work either:

root@es1019:~# ipmi-chassis --get-chassis-status
ipmi_cmd_get_chassis_status: bad completion code

I guess it needs another power drain as done here T213422#4869949

Change 513281 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool es1019 for maintenance

https://gerrit.wikimedia.org/r/513281

Change 513281 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool es1019 for maintenance

https://gerrit.wikimedia.org/r/513281

@Cmjohnson Was this done yesterday? I don't have any rush on this, but if it wasn't, I would need you to switch it on (I cannot access the management interface) so it is up during the weekend.

@jcrespo the server is back on and I am able to reach the mgmt interface.

Confirmed, I can access mgmt interface as well. I am going to enable puppet and start MySQL.

Mentioned in SAL (#wikimedia-operations) [2019-05-31T18:44:07Z] <marostegui> Start MySQL on es1019 - T213422

Change 513676 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool es1019

https://gerrit.wikimedia.org/r/513676

Change 513676 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool es1019

https://gerrit.wikimedia.org/r/513676

Mentioned in SAL (#wikimedia-operations) [2019-06-03T04:56:47Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Slowly repool es1019 T213422 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2019-06-03T05:15:21Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: More traffic to es1019 T213422 (duration: 00m 46s)

Change 513946 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Fully repool es1019

https://gerrit.wikimedia.org/r/513946

Change 513946 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Fully repool es1019

https://gerrit.wikimedia.org/r/513946

Mentioned in SAL (#wikimedia-operations) [2019-06-03T05:50:29Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Fully repool es1019 T213422 (duration: 00m 46s)

Host fully repooled with its original weight.