Page MenuHomePhabricator

cp1085 - IPMI not working
Closed, ResolvedPublic

Description

on cp1085, IPMI does not work (noticed because sometimes the puppet run gets stuck at the "Loading facts" step?)

i tried the commands from https://wikitech.wikimedia.org/wiki/Management_Interfaces

and locally it fails with:

ipmi_cmd_get_chassis_status: driver timeout

and from remote it fails with:

Error: Unable to establish IPMI v2 / RMCP+ session

Trying to SSH to the mgmt interface failed with:

No more sessions are available for this type of connection!

In Icinga there is an existing comment that this is "known" but there was no ticket yet.

Related Objects

StatusSubtypeAssignedTask
ResolvedJclark-ctr

Event Timeline

Dzahn renamed this task from cp1085 - IPMI not working - puppet runs stuck to cp1085 - IPMI not working.Aug 29 2019, 9:03 AM
Dzahn added projects: SRE, Traffic, ops-eqiad.
Dzahn added subscribers: ema, Cmjohnson.

IPMI sensor status in Icinga is UNKNOWN

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cp1085&service=IPMI+Sensor+Status

ACKED with link to this ticket

ema triaged this task as Medium priority.Aug 29 2019, 10:50 AM

looks like the mgmt is locked out and this server will require a hard reboot and flea power drain. please let me know when it's safe to turn the server off for 5-10 mins.

Hi @Dzahn - just following up on this one, to see when the server can be taken down. Thanks, Willy

Thanks @wiki_willy While I was the reporter caching servers are ultimately handled by the traffic team so i would like to at least cc: them if we can depool this "cache::text" server anytime for maintenance.

Dzahn added a parent task: Restricted Task.Sep 25 2019, 2:16 AM

@Papaul confirmed this looks like it needs onsite to drain the power. I asked @Vgutierrez about depooling this.

Could i up the priority a bit due to the relation to T147074?

Dzahn raised the priority of this task from Medium to High.Sep 25 2019, 2:29 AM

@Dzahn - just wanted to confirm that this has been depooled. Thanks, Willy

No, it's not depooled. Let's wait a day please because traffic is mostly out today.

Ok @Dzahn - just let us know when it's ready to go. Thanks, Willy

we can depool it just before shutting it down, just let us know when you want to do it

wiki_willy added a subscriber: Jclark-ctr.

Hi @Jclark-ctr - can you hit up @Vgutierrez when you get in during the AM sometime this week to depool the host? You guys have overlap in the mornings, until about 10am ET. Thanks, Willy

Mentioned in SAL (#wikimedia-operations) [2019-10-09T12:28:23Z] <vgutierrez> depooling cp1085 for a power drain - T231525

Icinga downtime for 3:00:00 set by vgutierrez@cumin1001 on 1 host(s) and their services with reason: Power drain

cp1085.eqiad.wmnet

Issue solved after performing a power drain. Thanks @Jclark-ctr

Removed all power from host , pulled both psu and performed Power Drain

thanks!

mgmt password updated using cookbook.