es1019 IPMI and its management interface are unresponsive (again)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Jan 10 2019, 12:50 PM

Description

For the fifth time, this happens: T120689 T155691 T187530 T201132

At least, this is now detected automatically:

es1019	IPMI Sensor Status:	UNKNOWN	2019-01-10 12:42:40	31d 12h 55m 2s	3/3	ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-es1019.localhost: internal IPMI error

Both SSH and IPMI interfaces are down.

Details

Subject	Repo	Branch	Lines +/-
db-eqiad.php: Fully repool es1019	operations/mediawiki-config	master	+1 -1
db-eqiad.php: Slowly repool es1019	operations/mediawiki-config	master	+1 -1
mariadb: Depool es1019 for maintenance	operations/mediawiki-config	master	+3 -3
mariadb: Repool es1019 with low load after maintenance	operations/mediawiki-config	master	+2 -2
mariadb: Depool es1019 for maintenance	operations/mediawiki-config	master	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	faidon	T125205 Monitor hardware thermal issues
Resolved	faidon	T167121 Several hosts return "internal IPMI error" in the check_ipmi_temp check
Resolved	Volans	T193155 IPMI Audit 2018-04
Resolved	• Cmjohnson	T213422 es1019 IPMI and its management interface are unresponsive (again)

Event Timeline

jcrespo created this task.Jan 10 2019, 12:50 PM

Restricted Application added a project: SRE. · View Herald TranscriptJan 10 2019, 12:50 PM

I will first try remote debugging techniques myself.

jcrespo added parent tasks: T167121: Several hosts return "internal IPMI error" in the check_ipmi_temp check, T193155: IPMI Audit 2018-04.Jan 10 2019, 12:53 PM

@jcrespo you could try first any of the known/listed things in https://wikitech.wikimedia.org/wiki/Management_Interfaces (aliased from IPMI) and of course feel free to expand it if it's something different.

That was the plan :-)

@Volans I have no ssh, https or ipmi access, so there is nothing I can do about it. This needs a power drain.

Please ping us in advance @Cmjohnson or let's schedule for a time well in advance for the maintenance as the host is still in production (and it may take some hours to switch production traffic away)..

Change 483415 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool es1019 for maintenance

https://gerrit.wikimedia.org/r/483415

gerritbot added a project: Patch-For-Review.Jan 10 2019, 2:04 PM

Change 483415 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool es1019 for maintenance

https://gerrit.wikimedia.org/r/483415

I need to power this off and unplug it for 10-20 secs. LMK if I can do
that today

@Cmjohnson Sorry, cannot today for both organizational reasons (@ at meeting today) and technical ones (cannot depool today due to traffic without being too disruptive). Let's try Tuesday if you are ok with that?

@jcrespo Sure...Tuesday works

Mentioned in SAL (#wikimedia-operations) [2019-01-15T13:53:08Z] <marostegui> Downtime db1115 and es1019 for 4 hours - T196726 T213422

Mentioned in SAL (#wikimedia-operations) [2019-01-15T16:00:58Z] <jynus> stop es1019 for hw maintenance T213422

Waiting for Chris to be available to fully shutdown it (as otherwise I wouldn't be able to put it back up).

@Cmjohnson Please see if there is any update to the mgmt firmware, as this has happened 5 times already to avoid pinging you so often.

es1019 is back up and mgmt is working. Not starting mysql though, until chris confirms everthing is done and ok (and he is understandingly busy right now).

In T213422#4882368, @jcrespo wrote:

es1019 is back up and mgmt is working. Not starting mysql though, until chris confirms everthing is done and ok (and he is understandingly busy right now).

I think we are good to go

Taking care of it.

While the server was down I updated, BIOS, raid firmware and hardware
firmware to the latest updates

Change 484673 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Repool es1019 with low load after maintenance

https://gerrit.wikimedia.org/r/484673

Change 484673 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Repool es1019 with low load after maintenance

https://gerrit.wikimedia.org/r/484673

es1019 is back into service.

This has happened again - I guess a cold reset is needed?:

09:52:20 <+icinga-wm> PROBLEM - Host es1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100%

root@cumin1001:~# ping -c10 es1019.mgmt.eqiad.wmnet
PING es1019.mgmt.eqiad.wmnet (10.65.4.44) 56(84) bytes of data.

--- es1019.mgmt.eqiad.wmnet ping statistics ---
10 packets transmitted, 0 received, 100% packet loss, time 9203ms

Locally, it doesn't work either:

root@es1019:~# ipmi-chassis --get-chassis-status
ipmi_cmd_get_chassis_status: bad completion code

I guess it needs another power drain as done here T213422#4869949

Maintenance_bot removed a project: Patch-For-Review.May 22 2019, 3:40 PM

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.May 28 2019, 2:54 PM

Change 513281 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool es1019 for maintenance

https://gerrit.wikimedia.org/r/513281

gerritbot added a project: Patch-For-Review.May 30 2019, 1:14 PM

Change 513281 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool es1019 for maintenance

https://gerrit.wikimedia.org/r/513281

Maintenance_bot removed a project: Patch-For-Review.May 30 2019, 3:11 PM

Mentioned in SAL (#wikimedia-operations) [2019-05-30T15:36:57Z] <jynus> stop es1019 for maintenance T213422

@Cmjohnson Was this done yesterday? I don't have any rush on this, but if it wasn't, I would need you to switch it on (I cannot access the management interface) so it is up during the weekend.

@jcrespo the server is back on and I am able to reach the mgmt interface.

Confirmed, I can access mgmt interface as well. I am going to enable puppet and start MySQL.

Mentioned in SAL (#wikimedia-operations) [2019-05-31T18:44:07Z] <marostegui> Start MySQL on es1019 - T213422

Change 513676 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool es1019

https://gerrit.wikimedia.org/r/513676

gerritbot added a project: Patch-For-Review.May 31 2019, 7:18 PM

Change 513676 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool es1019

https://gerrit.wikimedia.org/r/513676

Mentioned in SAL (#wikimedia-operations) [2019-06-03T04:56:47Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Slowly repool es1019 T213422 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2019-06-03T05:15:21Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: More traffic to es1019 T213422 (duration: 00m 46s)

Change 513946 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Fully repool es1019

https://gerrit.wikimedia.org/r/513946

Change 513946 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Fully repool es1019

https://gerrit.wikimedia.org/r/513946

Mentioned in SAL (#wikimedia-operations) [2019-06-03T05:50:29Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Fully repool es1019 T213422 (duration: 00m 46s)

Host fully repooled with its original weight.

Maintenance_bot removed a project: Patch-For-Review.Sep 23 2019, 2:11 PM

jcrespo mentioned this in T233698: es1019 IPMI and its management interface are unresponsive (again2).Sep 24 2019, 8:52 AM

jcrespo mentioned this in T243963: es1019: reseat IPMI.Jan 30 2020, 5:37 PM

es1019 IPMI and its management interface are unresponsive (again)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

es1019 IPMI and its management interface are unresponsive (again)
Closed, ResolvedPublic
Actions

Related Objects
Search...