Page MenuHomePhabricator

db1077 crashed
Closed, ResolvedPublic

Description

@Cmjohnson this host crashed with this HW error

/system1/log1/record14
  Targets
  Properties
    number=14
    severity=Caution
    date=06/10/2019
    time=05:23
    description=Smart Storage Battery has exceeded the maximum amount of devices supported (Battery 1, service information: 0x07). Action: 1. Remove additional devices. 2. Consult server troubleshooting guide. 3. Gather AHS log and contact Support

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptJun 10 2019, 5:38 AM

This is s3's sanitarium master, so for now s3 on labs will be lagging until we fix this host

Marostegui triaged this task as High priority.Jun 10 2019, 5:41 AM

@Cmjohnson looks like we have to first upgrade all the firwmare: https://support.hpe.com/hpsc/doc/public/display?docId=mmr_kc-0134828

Change 516084 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1077: Disable notifications

https://gerrit.wikimedia.org/r/516084

@Cmjohnson I will leave MySQL down so you can upgrade this host's firmwares as soon as you can without waiting for us to stop MySQL

Change 516084 merged by Marostegui:
[operations/puppet@production] db1077: Disable notifications

https://gerrit.wikimedia.org/r/516084

Mentioned in SAL (#wikimedia-operations) [2019-06-10T16:24:56Z] <marostegui> Power reset db1077 from the idrac T225391

Cmjohnson reassigned this task from Cmjohnson to Marostegui.Jun 10 2019, 6:48 PM

I updated with the service pack and powered on...reassigning to @Marostegui

Thanks @Cmjohnson - I can see that on the logs:

/system1/log1/record15
  Targets
  Properties
    number=15
    severity=Informational
    date=06/10/2019
    time=16:34
    description=Firmware flashed (System BIOS - P89 v2.60 (05/21/2018))
  Verbs
    cd version exit show

/system1/log1/record16
  Targets
  Properties
    number=16
    severity=Informational
    date=06/10/2019
    time=16:38
    description=Firmware flashed (iLO 4 2.60)
  Verbs
    cd version exit show

Also everything looks good BBU-wise:

Smart Array P840 in Slot 1
   Controller Status: OK
   Cache Status: Not Configured
   Battery/Capacitor Status: OK

I am going to run a full upgrade and start MySQL, once it has caught up I will run a data check.

@Cmjohnson can you also check the one of th power supply cable? It might be loose:

/system1/log1/record17
  Targets
  Properties
    number=17
    severity=Caution
    date=06/10/2019
    time=17:16
    description=System Power Supply: Input Power Loss or Unplugged Power Cord, Verify Power Supply Input (Power Supply 2)
  Verbs
    cd version exit show

/system1/log1/record18
  Targets
  Properties
    number=18
    severity=Critical
    date=06/10/2019
    time=17:16
    description=Server Critical Fault (Service Information: Input Power Loss, Power Supply,  Power Supply 1 (03h)  Power Supply 2 (03h))
  Verbs
    cd version exit show


/system1/log1/record19
  Targets
  Properties
    number=19
    severity=Caution
    date=06/10/2019
    time=17:16
    description=System Power Supplies Not Redundant
  Verbs
    cd version exit show

MySQL started correctly, I have upgraded it and started replication as everything looked fine.
Once it is up to date, I will run some data checks.

I have started a compare for main tables on s3 wikis.

Cmjohnson closed this task as Resolved.Jun 11 2019, 3:35 PM

@Marostegui that log entry may have been old. The server has both power supplies connected and does not report any current errors. Resolving the task.

@Cmjohnson thank you!. The record looks like from 10th June but might be related to your maintenance actually:

/system1/log1/record19
  Targets
  Properties
    number=19
    severity=Caution
    date=06/10/2019
    time=17:16
    description=System Power Supplies Not Redundant

Icinga also report good power supplies, which matches the other sensors on the idrac:

/system1/powersupply1
  Targets
  Properties
    ElementName=Power Supply
    OperationalStatus=Ok
    HealthState=Good, In Use
  Verbs
    cd version exit show


</system1>hpiLO-> show powersupply2

status=0
status_tag=COMMAND COMPLETED
Tue Jun 11 15:48:53 2019



/system1/powersupply2
  Targets
  Properties
    ElementName=Power Supply
    OperationalStatus=Ok
    HealthState=Good, In Use
  Verbs
    cd version exit show

The host has finished the data checksum for main tables for all the 900 wikis, so I will be repooling it back tomorrow.
Thank you!

Change 516589 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool db1077

https://gerrit.wikimedia.org/r/516589

Change 516589 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool db1077

https://gerrit.wikimedia.org/r/516589

db1077 has been fully repooled.

db1077 has had its BBU in charging status for around 30h now. I have taken a look at the HW logs and:

/system1/log1/record20
  Targets
  Properties
    number=20
    severity=Caution
    date=06/14/2019
    time=13:38
    description=Smart Storage Battery pre-failure (Battery 1). Action: 1. Consult server troubleshooting guide. 2 Gather AHS log and contact Support
  Verbs
    cd version exit show

Looks like it is not healthy after all. We might need to replace this host with db1107/db1108 once they are no longer in use. Or maybe exchange this host with db1112 (MCR testing host) as the BBU isn't really important for db1112 anyways.

Also db1114 (test-s1) can be a host we can place instead of db1077 and move db1077 to be test-s1?

note db1114 was a host we removed from production because it was unstable. I would vote for another. Did you try depooling and forcing a learning cycle?

note db1114 was a host we removed from production because it was unstable. I would vote for another. Did you try depooling and forcing a learning cycle?

So maybe db1112 then is better.
I am for now letting it finish the charging status (unfortunately with HP I have not been able to find which % it has charged already). Let's give it another 24h before forcing it

And after the reboot the battery fully failed T226154:

Battery/Capacitor Count: 0