Page MenuHomePhabricator

labsdb1009 broken PSU
Closed, ResolvedPublic

Description

As spoken in IRC, probably caused by the a1-pdu issues (T233248), labsdb1009 has lost its PSU redundancy.

</system1/log1>hpiLO-> show record27

status=0
status_tag=COMMAND COMPLETED
Thu Sep 19 04:45:31 2019



/system1/log1/record27
  Targets
  Properties
    number=27
    severity=Caution
    date=09/18/2019
    time=18:02
    description=System Power Supply: Input Power Loss or Unplugged Power Cord, Verify Power Supply Input (Power Supply 1)
  Verbs
    cd version exit show


</system1/log1>hpiLO-> show record28

status=0
status_tag=COMMAND COMPLETED
Thu Sep 19 04:45:34 2019



/system1/log1/record28
  Targets
  Properties
    number=28
    severity=Caution
    date=09/18/2019
    time=18:02
    description=System Power Supplies Not Redundant

Event Timeline

wiki_willy added a subtask: Unknown Object (Task).Sep 19 2019, 6:29 AM

This server is out of warranty and @RobH has created a procurement task.

This server is out of warranty and @RobH has created a procurement task.

Indeed Chris - thanks! I caught up with Willy earlier today and he pointed me to the procurement task
Will leave this task open to handle the tech side of things related to this server
Thank you again!

@Jclark-ctr - this arrived Thursday via https://www.fedex.com/en-us/home.html. Just a heads up, this will need to be replaced before the PDU upgrade next Tuesday, to retain redundant power on labsdb1009. Thanks, Willy

I believe we don't have to put the host down for the PSU replacement, do we?
However I would like to depool and stop mysql before, as a crash with mysql running could cause data corruption (and it has 7TB of data). So please ping me before any work on the host.

Yup, it should be a hot swap. So @Jclark-ctr - please reach out to @Marostegui before replacing it. Thanks, Willy

@Marostegui Received PSU. would like to replace Monday

Sounds good thanks - I will have this host ready.

Change 542778 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1011: Depool labsdb1009

https://gerrit.wikimedia.org/r/542778

Change 542778 merged by Marostegui:
[operations/puppet@production] dbproxy1011: Depool labsdb1009

https://gerrit.wikimedia.org/r/542778

Mentioned in SAL (#wikimedia-operations) [2019-10-14T04:56:53Z] <marostegui> Depool labsdb1009 for on-site maintenance - T233273

Marostegui closed subtask Unknown Object (Task) as Resolved.Oct 14 2019, 5:04 AM

Mentioned in SAL (#wikimedia-operations) [2019-10-14T07:16:34Z] <marostegui> Stop MySQL on labsdb1009 for on-site maintenance - T233273

@Jclark-ctr you can proceed and change the PSU now. MySQL has been stopped.

Thanks John!
The alert recovered:

Sensor Type(s) Temperature, Power_Supply Status: OK
	
	This service is currently in a period of scheduled downtime	View Extra Service Notes	OK	2019-10-14 13:05:54	0d 0h 8m 15s	1/3	Sensor Type(s) Temperature, Power_Supply Status: OK

On the HW logs I see:

/system1/powersupply1
  Targets
  Properties
    ElementName=Power Supply
    OperationalStatus=Ok
    HealthState=Good, In Use
  Verbs
    cd version exit show


</system1>hpiLO-> show powersupply2

status=0
status_tag=COMMAND COMPLETED
Mon Oct 14 13:09:51 2019



/system1/powersupply2
  Targets
  Properties
    ElementName=Power Supply
    OperationalStatus=Ok
    HealthState=Good, In Use
  Verbs
    cd version exit show

Mentioned in SAL (#wikimedia-operations) [2019-10-14T13:42:51Z] <marostegui> Repool labsdb1009 after PSU replacement - T233273