Page MenuHomePhabricator

db1092 crash
Closed, ResolvedPublic

Description

db1092 crashed today with no response from serial console (hard lockup). I rebooted it at the time of the alert.

It was logged, and rebooted.

Details

Related Gerrit Patches:
operations/mediawiki-config : masterdb-eqiad.php: Repool db1092
operations/mediawiki-config : masterDepool db1091 to apply blocking schema change
operations/mediawiki-config : masterdb1092 crashed and was offline for a bit

Event Timeline

RobH created this task.Nov 21 2016, 11:57 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 21 2016, 11:57 PM

Change 322801 had a related patch set uploaded (by RobH):
db1092 crashed and was offline for a bit

https://gerrit.wikimedia.org/r/322801

RobH updated the task description. (Show Details)

Change 322801 merged by jenkins-bot:
db1092 crashed and was offline for a bit

https://gerrit.wikimedia.org/r/322801

Mentioned in SAL (#wikimedia-operations) [2016-11-22T00:03:02Z] <reedy@tin> Synchronized wmf-config/db-eqiad.php: Depool db1092 after crash T151272 (duration: 00m 59s)

Thanks for Robh for taking care of this. I am going to have a look to see if we can find why it crashed.

Mentioned in SAL (#wikimedia-operations) [2016-11-22T07:23:07Z] <marostegui> Reboot db1092 for RAID controller upgrade - T151272

Error from yesterday

/system1/log1/record12
  Targets
  Properties
    number=12
    severity=Caution
    date=11/21/2016
    time=23:52
    description=Option ROM POST Error: 1719-Slot 1 Drive Array - A controller failure event occurred prior to this power-up. (Previous lock up code = 0x13) Action: Install the latest controller firmware. If the problem persists, replace the controller.

The firmware controller isn't the latest one as per (T141756)

root@db1092:~#  hpssacli controller slot=1 show | grep -i firmware
   Firmware Version: 3.56

So I have upgraded it:

root@db1092:~# hpssacli controller slot=1 show | grep -i firmware
   Firmware Version: 4.02

The array looks fine though:

logicaldrive 1 (3.6 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 800 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SATA, 800 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, Solid State SATA, 800 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, Solid State SATA, 800 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, Solid State SATA, 800 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, Solid State SATA, 800 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, Solid State SATA, 800 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 800 GB, OK)
      physicaldrive 2I:2:1 (port 2I:box 2:bay 1, Solid State SATA, 800 GB, OK)
      physicaldrive 2I:2:2 (port 2I:box 2:bay 2, Solid State SATA, 800 GB, OK)

I have started MySQL without replication, and as there were no errores, I have started replication thread. The server will remain depooled until we are sure it works fine.


This server has had some other issues with the power supplies, which according to the ILO are fine now - this is for the record:

/system1/log1/record8
  Targets
  Properties
    number=8
    severity=Repaired
    date=10/31/2016
    time=15:52
    description=System Power Supply: Input Power Loss or Unplugged Power Cord, Verify Power Supply Input (Power Supply 1)
  Verbs
    cd version exit show


</system1/log1>hpiLO-> show record9

status=0
status_tag=COMMAND COMPLETED
Tue Nov 22 07:08:36 2016



/system1/log1/record9
  Targets
  Properties
    number=9
    severity=Repaired
    date=10/31/2016
    time=15:52
    description=System Power Supplies Not Redundant
  Verbs
    cd version exit show


</system1/log1>hpiLO-> show record10

status=0
status_tag=COMMAND COMPLETED
Tue Nov 22 07:08:38 2016



/system1/log1/record10
  Targets
  Properties
    number=10
    severity=Repaired
    date=11/02/2016
    time=17:31
    description=System Power Supply: Input Power Loss or Unplugged Power Cord, Verify Power Supply Input (Power Supply 1)
  Verbs
    cd version exit show


</system1/log1>hpiLO-> show record11

status=0
status_tag=COMMAND COMPLETED
Tue Nov 22 07:08:40 2016



/system1/log1/record11
  Targets
  Properties
    number=11
    severity=Repaired
    date=11/02/2016
    time=17:31
    description=System Power Supplies Not Redundant
  Verbs
    cd version exit show

Change 322858 had a related patch set uploaded (by Jcrespo):
Depool db1091 to apply blocking schema change

https://gerrit.wikimedia.org/r/322858

Change 322858 merged by jenkins-bot:
Depool db1091 to apply blocking schema change

https://gerrit.wikimedia.org/r/322858

jcrespo moved this task from Triage to Next on the DBA board.Nov 23 2016, 2:23 PM
Marostegui triaged this task as Medium priority.Nov 25 2016, 11:57 AM

Change 323791 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Repool db1092

https://gerrit.wikimedia.org/r/323791

Change 323791 merged by jenkins-bot:
db-eqiad.php: Repool db1092

https://gerrit.wikimedia.org/r/323791

Mentioned in SAL (#wikimedia-operations) [2016-11-28T07:38:26Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1092 - T151272 (duration: 00m 47s)

Marostegui closed this task as Resolved.Nov 28 2016, 7:39 AM

I have repooled this server after a week of no issues.