db1092 crashed today with no response from serial console (hard lockup). I rebooted it at the time of the alert.
It was logged, and rebooted.
db1092 crashed today with no response from serial console (hard lockup). I rebooted it at the time of the alert.
It was logged, and rebooted.
Change 322801 had a related patch set uploaded (by RobH):
db1092 crashed and was offline for a bit
Mentioned in SAL (#wikimedia-operations) [2016-11-22T00:03:02Z] <reedy@tin> Synchronized wmf-config/db-eqiad.php: Depool db1092 after crash T151272 (duration: 00m 59s)
Thanks for Robh for taking care of this. I am going to have a look to see if we can find why it crashed.
Mentioned in SAL (#wikimedia-operations) [2016-11-22T07:23:07Z] <marostegui> Reboot db1092 for RAID controller upgrade - T151272
Error from yesterday
/system1/log1/record12
Targets
Properties
number=12
severity=Caution
date=11/21/2016
time=23:52
description=Option ROM POST Error: 1719-Slot 1 Drive Array - A controller failure event occurred prior to this power-up. (Previous lock up code = 0x13) Action: Install the latest controller firmware. If the problem persists, replace the controller.The firmware controller isn't the latest one as per (T141756)
root@db1092:~# hpssacli controller slot=1 show | grep -i firmware Firmware Version: 3.56
So I have upgraded it:
root@db1092:~# hpssacli controller slot=1 show | grep -i firmware Firmware Version: 4.02
The array looks fine though:
logicaldrive 1 (3.6 TB, RAID 1+0, OK)
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 800 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SATA, 800 GB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, Solid State SATA, 800 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, Solid State SATA, 800 GB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, Solid State SATA, 800 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, Solid State SATA, 800 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, Solid State SATA, 800 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 800 GB, OK)
physicaldrive 2I:2:1 (port 2I:box 2:bay 1, Solid State SATA, 800 GB, OK)
physicaldrive 2I:2:2 (port 2I:box 2:bay 2, Solid State SATA, 800 GB, OK)I have started MySQL without replication, and as there were no errores, I have started replication thread. The server will remain depooled until we are sure it works fine.
This server has had some other issues with the power supplies, which according to the ILO are fine now - this is for the record:
/system1/log1/record8
Targets
Properties
number=8
severity=Repaired
date=10/31/2016
time=15:52
description=System Power Supply: Input Power Loss or Unplugged Power Cord, Verify Power Supply Input (Power Supply 1)
Verbs
cd version exit show
</system1/log1>hpiLO-> show record9
status=0
status_tag=COMMAND COMPLETED
Tue Nov 22 07:08:36 2016
/system1/log1/record9
Targets
Properties
number=9
severity=Repaired
date=10/31/2016
time=15:52
description=System Power Supplies Not Redundant
Verbs
cd version exit show
</system1/log1>hpiLO-> show record10
status=0
status_tag=COMMAND COMPLETED
Tue Nov 22 07:08:38 2016
/system1/log1/record10
Targets
Properties
number=10
severity=Repaired
date=11/02/2016
time=17:31
description=System Power Supply: Input Power Loss or Unplugged Power Cord, Verify Power Supply Input (Power Supply 1)
Verbs
cd version exit show
</system1/log1>hpiLO-> show record11
status=0
status_tag=COMMAND COMPLETED
Tue Nov 22 07:08:40 2016
/system1/log1/record11
Targets
Properties
number=11
severity=Repaired
date=11/02/2016
time=17:31
description=System Power Supplies Not Redundant
Verbs
cd version exit showChange 322858 had a related patch set uploaded (by Jcrespo):
Depool db1091 to apply blocking schema change
Change 323791 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Repool db1092
Mentioned in SAL (#wikimedia-operations) [2016-11-28T07:38:26Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1092 - T151272 (duration: 00m 47s)