Page MenuHomePhabricator

RAID crashed on db1078
Closed, ResolvedPublic

Description

This not only made a disk fail, it brought the storage off for some time, causing a database crash.

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host db1078. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Failed: 1I:1:4 - Controller: OK - Battery/Capacitor: OK

Smart Array P840 in Slot 1

   array A

      Logical Drive: 1
         Size: 3.6 TB
         Fault Tolerance: 1+0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Full Stripe Size: 1280 KB
         Status: Interim Recovery Mode
         MultiDomain Status: OK
         Caching:  Disabled
         Unique Identifier: 600508B1001CB4212BB3DB66161C1F69
         Disk Name: /dev/sda 
         Mount Points: / 37.3 GB Partition Number 2
         OS Status: LOCKED
         Logical Drive Label: 00892210PDNNF0ARH9O0QN7B82
         Mirror Group 1:
            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 800 GB, OK)
            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SATA, 800 GB, OK)
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, Solid State SATA, 800 GB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, Solid State SATA, 800 GB, Failed)
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, Solid State SATA, 800 GB, OK)
         Mirror Group 2:
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, Solid State SATA, 800 GB, OK)
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, Solid State SATA, 800 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 800 GB, OK)
            physicaldrive 2I:2:1 (port 2I:box 2:bay 1, Solid State SATA, 800 GB, OK)
            physicaldrive 2I:2:2 (port 2I:box 2:bay 2, Solid State SATA, 800 GB, OK)
         Drive Type: Data
         LD Acceleration Method: HP SSD Smart Path

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 15 2017, 9:38 AM
jcrespo renamed this task from Degraded RAID on db1078 to RAID crashed on db1078.Aug 15 2017, 9:45 AM
jcrespo added a project: DBA.
jcrespo updated the task description. (Show Details)
jcrespo added subscribers: Marostegui, Joe.
170815  9:26:51 [ERROR] InnoDB: Tried to read 16384 bytes at offset 20021248. Was only able to read 0.
2017-08-15 09:26:51 7f0b8bd0a700  InnoDB: Operating system error number 5 in a file operation.
InnoDB: Error number 5 means 'Input/output error'.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html
170815  9:26:51 [ERROR] InnoDB: File (unknown): 'read' returned OS error 105. Cannot continue operation
170815 09:28:20 mysqld_safe Number of processes running now: 0
170815 09:28:20 mysqld_safe mysqld restarted
170815  9:28:20 [Note] /opt/wmf-mariadb10/bin/mysqld (mysqld 10.0.23-MariaDB-log) starting as process 30180 ...
<_joe_> [44540604.714419] hpsa 0000:08:00.0: scsi 0:1:0:0: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap+ En+ Exp=1
<_joe_> [44540622.920488] hpsa 0000:08:00.0: scsi 0:1:0:0: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap+ En+ Exp=1
<_joe_> [44540632.810707] hpsa 0000:08:00.0: Acknowledging event: 0x40000030 (HP SSD Smart Path state change)
<_joe_> [Tue Aug 15 09:31:43 2017] hpsa 0000:08:00.0: scsi 0:0:3:0: removed Direct-Access     ATA      LK0800GEYMU      PHYS DRV SSDSmartPathCap- En- Exp=0
jcrespo triaged this task as High priority.Aug 15 2017, 10:47 AM
jcrespo added a subscriber: Cmjohnson.

@Cmjohnson aside from the disk replacement, please open a ticket with HP, if disk failure shouldn't be transparent with the current configuration, or if we need to disable smartpath and enable the RAID cache to behave like the DELL RAIDs.

A case for a new disk has been created. Your case was successfully submitted. Please note your Case ID: 5322179480 for future reference.

Regarding the DB crash because of 1 disk failure is odd. I will need to take the server offline. Let me know when it's safe to do so.

A case for a new disk has been created. Your case was successfully submitted. Please note your Case ID: 5322179480 for future reference.

Regarding the DB crash because of 1 disk failure is odd. I will need to take the server offline. Let me know when it's safe to do so.

I can have it offline for you in around 1h, is that ok?

Sure, sometime in the next few hours is fine.

Regarding the DB crash because of 1 disk failure is odd. I will need to take the server offline. Let me know when it's safe to do so.

Joe's thesis is that maybe smartpath setup doesn't provide real high availability and it would require the RAID cache enabled (which is odd indeed)- it took around 1m30s for the RAID to respond, causing an OS IO error, which is enough for InnoDB to crash itself to prevent further consistency issuers. Independently of the case -I would like at some pint to perform a controlled test (on a non-critical server) on the hardware physically extracting a disk while a server IO is loaded from an HP RAID and see if that can be replicated. So to know if this is a one-case (bad luck) scenario with the controller or something that will happen on all HP servers with the current config.

Mentioned in SAL (#wikimedia-operations) [2017-08-16T15:54:52Z] <marostegui> Stop MySQL and shutdown db1078 for HW checks - T173365

I verified the settings, everything appears normal.

Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Aug 16 2017, 5:30 PM

@Cmjohnson you think the disk will arrive today? I wouldn't want to leave this host depooled for the weekend :-(
We can always repool it even we a degraded array but not ideal I guess

Marostegui moved this task from Triage to In progress on the DBA board.Aug 21 2017, 8:20 AM

I would like to propose db1076 (s2) as a candidate host to do the test once db1078 is back in the pool with the new disk.
db1076 belongs to s2 and there are two more powerful hosts there.
It is from the same batch of db1078 and they have the same controller and driver firmware version.

If db1076 doesn't behave the same, I do believe we should test db1078 again and pull out a different disk and see how it behaves again, to see if it was a one time issue or if it is recurrent.

jcrespo added a subscriber: RobH.Aug 22 2017, 10:59 AM

Please @RobH @Cmjohnson help us move forward with this- s3 is with very reduced redundancy right now, plus if this is confirmed to be a problem affecting a large batch of servers, this can have a huge impact on wikimedia projects availability.

The disk was finally sent. HP added another report they wanted in addition
to the AHS log. That report would have required powering the server off
which is ridiculous for a failed disk.
I had to argue that they're wasting our time and the server is in a
critical state. Anyway it shipped and I will swap as soon as I get it.

The disk was finally sent. HP added another report they wanted in addition
to the AHS log. That report would have required powering the server off
which is ridiculous for a failed disk.
I had to argue that they're wasting our time and the server is in a
critical state. Anyway it shipped and I will swap as soon as I get it.

The server is depooled at the moment, if you have a way to run that and get it sent to HP, we could power it off for you.

Thanks!

@Marostegui The ssd has been replaced. Please resolve after rebuild

Thank you, Chris!

@RobH @Cmjohnson if you are ok with that, with less priority, we would like some disk degradation testing at some point in the future.

Change 373306 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Pool db1078 with low weight after maintenance

https://gerrit.wikimedia.org/r/373306

Change 373306 merged by Jcrespo:
[operations/mediawiki-config@master] mariadb: Pool db1078 with low weight after maintenance

https://gerrit.wikimedia.org/r/373306

Change 373363 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Repool db1078 with full weight after maintenance

https://gerrit.wikimedia.org/r/373363

@Marostegui The ssd has been replaced. Please resolve after rebuild

Should we close this ticket and create a new one for testing another host and see its behaviour?
I don't mind either way :-)

Change 373363 merged by Jcrespo:
[operations/mediawiki-config@master] mariadb: Repool db1078 with full weight after maintenance

https://gerrit.wikimedia.org/r/373363

jcrespo closed this task as Resolved.Aug 24 2017, 4:42 PM
jcrespo assigned this task to Cmjohnson.

Should we close this ticket and create a new one for testing another host and see its behaviour?

Let's just do it.

let me know which db you want to test and when?

Let's us some time to find a good candidate and create some fake load and we will ping either you or Papaul on T174054.

Return shipping info for disk

UPS 1ZW0948Y9082750467