Page MenuHomePhabricator

Power supply error on db1055
Closed, ResolvedPublic

Description

As per icinga:

Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]

Event Timeline

Change 397739 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1055

https://gerrit.wikimedia.org/r/397739

Marostegui triaged this task as Medium priority.Dec 12 2017, 6:46 AM
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui added a project: SRE.

Change 397739 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1055

https://gerrit.wikimedia.org/r/397739

Stashbot subscribed.

Mentioned in SAL (#wikimedia-operations) [2017-12-12T06:48:56Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1055 - T178359 T182653 (duration: 00m 56s)

Marostegui added a project: ops-eqiad.
Marostegui added a subscriber: Cmjohnson.

@Cmjohnson I have been unable to identify which of the PSU is the one failing, the idrac console isn't recording which one is it (sometimes it does).
The only thing I am able to see is:

		HealthState = 25 (Critical failure)
		OperationalStatus[0] = 6 (Error)

Do they have a LED so you can physically identify which one is the broken one?

Thanks!

@Marostegui Replaced the PSU and both are now redundant

Date/Time: 12/12/2017 14:43:15
Source: system
Severity: Critical

Description: Power supply 2 is absent.

Record: 15
Date/Time: 12/12/2017 14:43:35
Source: system
Severity: Ok

Description: Power supply 2 is present.

Record: 16
Date/Time: 12/12/2017 14:43:55
Source: system
Severity: Ok

Description: The input power for power supply 2 has been restored.

Record: 17
Date/Time: 12/12/2017 14:44:01
Source: system
Severity: Ok

Description: The power supplies are redundant.

That was fast! Thanks a lot

RECOVERY - IPMI Sensor Status on db1055 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK