db1092 crashed today with no response from serial console (hard lockup). I rebooted it at the time of the alert.
It was logged, and rebooted.
db1092 crashed today with no response from serial console (hard lockup). I rebooted it at the time of the alert.
It was logged, and rebooted.
Change 322801 had a related patch set uploaded (by RobH):
db1092 crashed and was offline for a bit
Mentioned in SAL (#wikimedia-operations) [2016-11-22T00:03:02Z] <reedy@tin> Synchronized wmf-config/db-eqiad.php: Depool db1092 after crash T151272 (duration: 00m 59s)
Thanks for Robh for taking care of this. I am going to have a look to see if we can find why it crashed.
Mentioned in SAL (#wikimedia-operations) [2016-11-22T07:23:07Z] <marostegui> Reboot db1092 for RAID controller upgrade - T151272
Error from yesterday
/system1/log1/record12 Targets Properties number=12 severity=Caution date=11/21/2016 time=23:52 description=Option ROM POST Error: 1719-Slot 1 Drive Array - A controller failure event occurred prior to this power-up. (Previous lock up code = 0x13) Action: Install the latest controller firmware. If the problem persists, replace the controller.
The firmware controller isn't the latest one as per (T141756)
root@db1092:~# hpssacli controller slot=1 show | grep -i firmware Firmware Version: 3.56
So I have upgraded it:
root@db1092:~# hpssacli controller slot=1 show | grep -i firmware Firmware Version: 4.02
The array looks fine though:
logicaldrive 1 (3.6 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 800 GB, OK) physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SATA, 800 GB, OK) physicaldrive 1I:1:3 (port 1I:box 1:bay 3, Solid State SATA, 800 GB, OK) physicaldrive 1I:1:4 (port 1I:box 1:bay 4, Solid State SATA, 800 GB, OK) physicaldrive 1I:1:5 (port 1I:box 1:bay 5, Solid State SATA, 800 GB, OK) physicaldrive 1I:1:6 (port 1I:box 1:bay 6, Solid State SATA, 800 GB, OK) physicaldrive 1I:1:7 (port 1I:box 1:bay 7, Solid State SATA, 800 GB, OK) physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 800 GB, OK) physicaldrive 2I:2:1 (port 2I:box 2:bay 1, Solid State SATA, 800 GB, OK) physicaldrive 2I:2:2 (port 2I:box 2:bay 2, Solid State SATA, 800 GB, OK)
I have started MySQL without replication, and as there were no errores, I have started replication thread. The server will remain depooled until we are sure it works fine.
This server has had some other issues with the power supplies, which according to the ILO are fine now - this is for the record:
/system1/log1/record8 Targets Properties number=8 severity=Repaired date=10/31/2016 time=15:52 description=System Power Supply: Input Power Loss or Unplugged Power Cord, Verify Power Supply Input (Power Supply 1) Verbs cd version exit show </system1/log1>hpiLO-> show record9 status=0 status_tag=COMMAND COMPLETED Tue Nov 22 07:08:36 2016 /system1/log1/record9 Targets Properties number=9 severity=Repaired date=10/31/2016 time=15:52 description=System Power Supplies Not Redundant Verbs cd version exit show </system1/log1>hpiLO-> show record10 status=0 status_tag=COMMAND COMPLETED Tue Nov 22 07:08:38 2016 /system1/log1/record10 Targets Properties number=10 severity=Repaired date=11/02/2016 time=17:31 description=System Power Supply: Input Power Loss or Unplugged Power Cord, Verify Power Supply Input (Power Supply 1) Verbs cd version exit show </system1/log1>hpiLO-> show record11 status=0 status_tag=COMMAND COMPLETED Tue Nov 22 07:08:40 2016 /system1/log1/record11 Targets Properties number=11 severity=Repaired date=11/02/2016 time=17:31 description=System Power Supplies Not Redundant Verbs cd version exit show
Change 322858 had a related patch set uploaded (by Jcrespo):
Depool db1091 to apply blocking schema change
Change 323791 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Repool db1092
Mentioned in SAL (#wikimedia-operations) [2016-11-28T07:38:26Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1092 - T151272 (duration: 00m 47s)