Page MenuHomePhabricator

es1019 IPMI and its management interface are unresponsive (again)
Closed, ResolvedPublic

Description

For the fifth time, this happens: T120689 T155691 T187530 T201132

At least, this is now detected automatically:

es1019	IPMI Sensor Status:	UNKNOWN	2019-01-10 12:42:40	31d 12h 55m 2s	3/3	ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-es1019.localhost: internal IPMI error

Both SSH and IPMI interfaces are down.

Details

Related Gerrit Patches:
operations/mediawiki-config : masterdb-eqiad.php: Fully repool es1019
operations/mediawiki-config : masterdb-eqiad.php: Slowly repool es1019
operations/mediawiki-config : mastermariadb: Depool es1019 for maintenance
operations/mediawiki-config : mastermariadb: Repool es1019 with low load after maintenance
operations/mediawiki-config : mastermariadb: Depool es1019 for maintenance

Event Timeline

jcrespo created this task.Jan 10 2019, 12:50 PM
Restricted Application added a project: Operations. · View Herald TranscriptJan 10 2019, 12:50 PM
jcrespo claimed this task.Jan 10 2019, 12:52 PM
jcrespo triaged this task as Medium priority.

I will first try remote debugging techniques myself.

Volans added a subscriber: Volans.Jan 10 2019, 1:13 PM

@jcrespo you could try first any of the known/listed things in https://wikitech.wikimedia.org/wiki/Management_Interfaces (aliased from IPMI) and of course feel free to expand it if it's something different.

That was the plan :-)

jcrespo reassigned this task from jcrespo to Cmjohnson.Jan 10 2019, 1:36 PM
jcrespo added a subscriber: Cmjohnson.

@Volans I have no ssh, https or ipmi access, so there is nothing I can do about it. This needs a power drain.

Please ping us in advance @Cmjohnson or let's schedule for a time well in advance for the maintenance as the host is still in production (and it may take some hours to switch production traffic away)..

Change 483415 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool es1019 for maintenance

https://gerrit.wikimedia.org/r/483415

Change 483415 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool es1019 for maintenance

https://gerrit.wikimedia.org/r/483415

I need to power this off and unplug it for 10-20 secs. LMK if I can do
that today

@Cmjohnson Sorry, cannot today for both organizational reasons (@ at meeting today) and technical ones (cannot depool today due to traffic without being too disruptive). Let's try Tuesday if you are ok with that?

@jcrespo Sure...Tuesday works

Mentioned in SAL (#wikimedia-operations) [2019-01-15T16:00:58Z] <jynus> stop es1019 for hw maintenance T213422

Waiting for Chris to be available to fully shutdown it (as otherwise I wouldn't be able to put it back up).

@Cmjohnson Please see if there is any update to the mgmt firmware, as this has happened 5 times already to avoid pinging you so often.

es1019 is back up and mgmt is working. Not starting mysql though, until chris confirms everthing is done and ok (and he is understandingly busy right now).

es1019 is back up and mgmt is working. Not starting mysql though, until chris confirms everthing is done and ok (and he is understandingly busy right now).

I think we are good to go

jcrespo claimed this task.Jan 16 2019, 11:57 AM

Taking care of it.

While the server was down I updated, BIOS, raid firmware and hardware
firmware to the latest updates

Change 484673 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Repool es1019 with low load after maintenance

https://gerrit.wikimedia.org/r/484673

Change 484673 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Repool es1019 with low load after maintenance

https://gerrit.wikimedia.org/r/484673

jcrespo closed this task as Resolved.Jan 16 2019, 3:11 PM
jcrespo reassigned this task from jcrespo to Cmjohnson.

es1019 is back into service.

Marostegui reopened this task as Open.EditedMay 12 2019, 8:30 AM

This has happened again - I guess a cold reset is needed?:

09:52:20 <+icinga-wm> PROBLEM - Host es1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
root@cumin1001:~# ping -c10 es1019.mgmt.eqiad.wmnet
PING es1019.mgmt.eqiad.wmnet (10.65.4.44) 56(84) bytes of data.

--- es1019.mgmt.eqiad.wmnet ping statistics ---
10 packets transmitted, 0 received, 100% packet loss, time 9203ms

Locally, it doesn't work either:

root@es1019:~# ipmi-chassis --get-chassis-status
ipmi_cmd_get_chassis_status: bad completion code

I guess it needs another power drain as done here T213422#4869949

Change 513281 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool es1019 for maintenance

https://gerrit.wikimedia.org/r/513281

Change 513281 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool es1019 for maintenance

https://gerrit.wikimedia.org/r/513281

Mentioned in SAL (#wikimedia-operations) [2019-05-30T15:36:57Z] <jynus> stop es1019 for maintenance T213422

@Cmjohnson Was this done yesterday? I don't have any rush on this, but if it wasn't, I would need you to switch it on (I cannot access the management interface) so it is up during the weekend.

Cmjohnson closed this task as Resolved.May 31 2019, 6:40 PM

@jcrespo the server is back on and I am able to reach the mgmt interface.

Confirmed, I can access mgmt interface as well. I am going to enable puppet and start MySQL.

Mentioned in SAL (#wikimedia-operations) [2019-05-31T18:44:07Z] <marostegui> Start MySQL on es1019 - T213422

Change 513676 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool es1019

https://gerrit.wikimedia.org/r/513676

Change 513676 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool es1019

https://gerrit.wikimedia.org/r/513676

Mentioned in SAL (#wikimedia-operations) [2019-06-03T04:56:47Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Slowly repool es1019 T213422 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2019-06-03T05:15:21Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: More traffic to es1019 T213422 (duration: 00m 46s)

Change 513946 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Fully repool es1019

https://gerrit.wikimedia.org/r/513946

Change 513946 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Fully repool es1019

https://gerrit.wikimedia.org/r/513946

Mentioned in SAL (#wikimedia-operations) [2019-06-03T05:50:29Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Fully repool es1019 T213422 (duration: 00m 46s)

Host fully repooled with its original weight.