Page MenuHomePhabricator

cp1085 - IPMI not working
Closed, ResolvedPublic

Description

on cp1085, IPMI does not work (noticed because sometimes the puppet run gets stuck at the "Loading facts" step?)

i tried the commands from https://wikitech.wikimedia.org/wiki/Management_Interfaces

and locally it fails with:

ipmi_cmd_get_chassis_status: driver timeout

and from remote it fails with:

Error: Unable to establish IPMI v2 / RMCP+ session

Trying to SSH to the mgmt interface failed with:

No more sessions are available for this type of connection!

In Icinga there is an existing comment that this is "known" but there was no ticket yet.

Related Objects

StatusAssignedTask
ResolvedJclark-ctr

Event Timeline

Dzahn created this task.Aug 29 2019, 9:02 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 29 2019, 9:02 AM
Dzahn renamed this task from cp1085 - IPMI not working - puppet runs stuck to cp1085 - IPMI not working.Aug 29 2019, 9:03 AM
Dzahn added projects: Operations, Traffic, ops-eqiad.
Dzahn added subscribers: ema, Cmjohnson.

IPMI sensor status in Icinga is UNKNOWN

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cp1085&service=IPMI+Sensor+Status

ACKED with link to this ticket

ema moved this task from Triage to Hardware on the Traffic board.Aug 29 2019, 10:26 AM
ema triaged this task as Normal priority.Aug 29 2019, 10:50 AM

looks like the mgmt is locked out and this server will require a hard reboot and flea power drain. please let me know when it's safe to turn the server off for 5-10 mins.

Cmjohnson moved this task from Backlog to Procurement on the ops-eqiad board.Aug 29 2019, 4:35 PM
Cmjohnson moved this task from Procurement to Hardware Failure / Troubleshoot on the ops-eqiad board.

Hi @Dzahn - just following up on this one, to see when the server can be taken down. Thanks, Willy

Dzahn added a comment.Sep 14 2019, 7:52 AM

Thanks @wiki_willy While I was the reporter caching servers are ultimately handled by the traffic team so i would like to at least cc: them if we can depool this "cache::text" server anytime for maintenance.

Dzahn added a parent task: Restricted Task.Wed, Sep 25, 2:16 AM

@Papaul confirmed this looks like it needs onsite to drain the power. I asked @Vgutierrez about depooling this.

Could i up the priority a bit due to the relation to T147074?

Dzahn raised the priority of this task from Normal to High.Wed, Sep 25, 2:29 AM

@Dzahn - just wanted to confirm that this has been depooled. Thanks, Willy

Dzahn added a comment.Mon, Oct 7, 9:36 PM

No, it's not depooled. Let's wait a day please because traffic is mostly out today.

Ok @Dzahn - just let us know when it's ready to go. Thanks, Willy

we can depool it just before shutting it down, just let us know when you want to do it

wiki_willy reassigned this task from Cmjohnson to Jclark-ctr.Wed, Oct 9, 3:59 AM
wiki_willy added a subscriber: Jclark-ctr.

Hi @Jclark-ctr - can you hit up @Vgutierrez when you get in during the AM sometime this week to depool the host? You guys have overlap in the mornings, until about 10am ET. Thanks, Willy

Mentioned in SAL (#wikimedia-operations) [2019-10-09T12:28:23Z] <vgutierrez> depooling cp1085 for a power drain - T231525

Icinga downtime for 3:00:00 set by vgutierrez@cumin1001 on 1 host(s) and their services with reason: Power drain

cp1085.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2019-10-09T13:37:24Z] <vgutierrez> repooling cp1085 - T231525

Vgutierrez closed this task as Resolved.Wed, Oct 9, 1:38 PM

Issue solved after performing a power drain. Thanks @Jclark-ctr

Removed all power from host , pulled both psu and performed Power Drain

thanks!

mgmt password updated using cookbook.