Page MenuHomePhabricator

Degraded RAID on ms-be2037
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host ms-be2037. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_md
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      
md0 : active raid1 sda1[0] sdb1[1](F)
      58559488 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 8 2018, 1:55 AM
Volans added a subscriber: fgiunchedi.
** 8 printk messages dropped ** [5424311.775321] sd 0:1:0:0: rejecting I/O to offline device
[5424311.832004] sd 0:1:0:0: rejecting I/O to offline device
** 8 printk messages dropped ** [5424311.832047] sd 0:1:0:0: rejecting I/O to offline device
** 9 printk messages dropped ** [5424311.887301] sd 0:1:0:0: rejecting I/O to offline device
** 9 printk messages dropped ** [5424311.943327] sd 0:1:0:0: rejecting I/O to offline device
[5424312.000904] sd 0:1:0:0: rejecting I/O to offline device
** 8 printk messages dropped ** [5424312.001157] sd 0:1:0:0: rejecting I/O to offline device
[5424312.054512] sd 0:1:0:0: rejecting I/O to offline device
** 8 printk messages dropped ** [5424312.054923] sd 0:1:0:0: rejecting I/O to offline device
[5424312.054927] sd 0:1:0:0: rejecting I/O to offline device
** 8 printk messages dropped ** [5424312.111138] sd 0:1:0:0: rejecting I/O to offline device
** 9 printk messages dropped ** [5424312.167841] sd 0:1:0:0: rejecting I/O to offline device
[5424312.167845] sd 0:1:0:0: rejecting I/O to offline device
** 8 printk messages dropped ** [5424312.223447] sd 0:1:0:0: rejecting I/O to offline device
** 3 printk messages dropped ** [5424312.283072] sd 0:1:0:0: rejecting I/O to offline device
[5424312.283076] sd 0:1:0:0: rejecting I/O to offline device
[5424312.283270] sd 0:1:0:0: rejecting I/O to offline device
[5424312.283276] sd 0:1:0:0: rejecting I/O to offline device
[5424312.283278] sd 0:1:0:0: rejecting I/O to offline device
[5424312.283281] sd 0:1:0:0: rejecting I/O to offline device

Mentioned in SAL (#wikimedia-operations) [2018-01-08T09:46:31Z] <godog> reboot ms-be2037 - T184390

Slot 3 Port 1 : Smart Array P840 Controller - (4096 MB, V4.52) 14 Logical
Drive(s) - Operation Failedit, this may take a few moments....
 - 1719-Slot 3 Drive Array - A controller failure event occurred prior
   to this power-up.  (Previous lock up code = 0x13) Action: Install the
   latest controller firmware. If the problem persists, replace the
   controller.

Mentioned in SAL (#wikimedia-operations) [2018-01-08T09:53:34Z] <godog> Flashing Smart Array P840 in Slot 3 [ 4.52 -> 6.06 ] on ms-be2037 - T184390 T141756

fgiunchedi closed this task as Resolved.Jan 8 2018, 9:59 AM
fgiunchedi claimed this task.

Looks like controller locked up and mdadm kicked one disk out of the array? Upon reboot the ssd show up and healthy (according to the controller anyway). I've also upgraded the firmware JIC since that was pending anyway.

=> set target controller slot=3

   "controller slot=3"

=> ld all show

Smart Array P840 in Slot 3

   array A

      logicaldrive 1 (447.1 GB, RAID 0, OK)

   array B

      logicaldrive 2 (447.1 GB, RAID 0, OK)

   array C

      logicaldrive 3 (3.6 TB, RAID 0, OK)

   array D

      logicaldrive 4 (3.6 TB, RAID 0, OK)

   array E

      logicaldrive 5 (3.6 TB, RAID 0, OK)

   array F

      logicaldrive 6 (3.6 TB, RAID 0, OK)

   array G

      logicaldrive 7 (3.6 TB, RAID 0, OK)

   array H

      logicaldrive 8 (3.6 TB, RAID 0, OK)

   array I

      logicaldrive 9 (3.6 TB, RAID 0, OK)

   array J

      logicaldrive 10 (3.6 TB, RAID 0, OK)

   array K

      logicaldrive 11 (3.6 TB, RAID 0, OK)

   array L

      logicaldrive 12 (3.6 TB, RAID 0, OK)

   array M

      logicaldrive 13 (3.6 TB, RAID 0, OK)

   array N

      logicaldrive 14 (3.6 TB, RAID 0, OK)

=> pd all show

Smart Array P840 in Slot 3

   array A

      physicaldrive 2I:4:1 (port 2I:box 4:bay 1, Solid State SATA, 480.1 GB, OK)

   array B

      physicaldrive 2I:4:2 (port 2I:box 4:bay 2, Solid State SATA, 480.1 GB, OK)

   array C

      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA, 4000.7 GB, OK)

   array D

      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA, 4000.7 GB, OK)

   array E

      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SATA, 4000.7 GB, OK)

   array F

      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SATA, 4000.7 GB, OK)

   array G

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 4000.7 GB, OK)

   array H

      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 4000.7 GB, OK)

   array I

      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 4000.7 GB, OK)

   array J

      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SATA, 4000.7 GB, OK)

   array K

      physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SATA, 4000.7 GB, OK)

   array L

      physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SATA, 4000.7 GB, OK)

   array M

      physicaldrive 2I:2:3 (port 2I:box 2:bay 3, SATA, 4000.7 GB, OK)

   array N

      physicaldrive 2I:2:4 (port 2I:box 2:bay 4, SATA, 4000.7 GB, OK)

=>
238482n375 removed fgiunchedi as the assignee of this task.Jun 15 2018, 8:02 AM
238482n375 triaged this task as Lowest priority.
238482n375 moved this task from Next Up to In Code Review on the Analytics-Kanban board.
238482n375 edited subscribers, added: 238482n375; removed: Aklapper.

SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zLgoKICAgIFNlYXJjaAoKQ3JlYXRlIFRhc2sKTWFuaXBoZXN0ClQxOTcyODEKRml4IGZhaWxpbmcgd2VicmVxdWVzdCBob3VycyAodXBsb2FkIGFuZCB0ZXh0IDIwMTgtMDYtMTQtMTEpCk9wZW4sIE5lZWRzIFRyaWFnZVB1YmxpYwoKICAgIEVkaXQgVGFzawogICAgRWRpdCBSZWxhdGVkIFRhc2tzLi4uCiAgICBFZGl0IFJlbGF0ZWQgT2JqZWN0cy4uLgogICAgUHJvdGVjdCBhcyBzZWN1cml0eSBpc3N1ZQoKICAgIE11dGUgTm90aWZpY2F0aW9ucwogICAgQXdhcmQgVG9rZW4KICAgIEZsYWcgRm9yIExhdGVyCgpUYWdzCgogICAgQW5hbHl0aWNzLUthbmJhbiAoSW4gUHJvZ3Jlc3MpCgpTdWJzY3JpYmVycwpBa2xhcHBlciwgSkFsbGVtYW5kb3UKQXNzaWduZWQgVG8KSkFsbGVtYW5kb3UKQXV0aG9yZWQgQnkKSkFsbGVtYW5kb3UsIEZyaSwgSnVuIDE1CkRlc2NyaXB0aW9uCgpPb3ppZSBqb2JzIGhhdmUgYmVlbiBmYWlsaW5nIGF0IGxlYXN0IGEgZmV3IHRpbWVzIGVhY2guIE1vcmUgaW52ZXN0aWdhdGlvbiBuZWVkZWQuCkpBbGxlbWFuZG91IGNyZWF0ZWQgdGhpcyB0YXNrLkZyaSwgSnVuIDE1LCA3OjIxIEFNCkhlcmFsZCBhZGRlZCBhIHN1YnNjcmliZXI6IEFrbGFwcGVyLiC3IFZpZXcgSGVyYWxkIFRyYW5zY3JpcHRGcmksIEp1biAxNSwgNzoyMSBBTQpKQWxsZW1hbmRvdSBjbGFpbWVkIHRoaXMgdGFzay5GcmksIEp1biAxNSwgNzoyMiBBTQpKQWxsZW1hbmRvdSB1cGRhdGVkIHRoZSB0YXNrIGRlc2NyaXB0aW9uLiAoU2hvdyBEZXRhaWxzKQpKQWxsZW1hbmRvdSBhZGRlZCBhIHByb2plY3Q6IEFuYWx5dGljcy1LYW5iYW4uCkpBbGxlbWFuZG91IG1vdmVkIHRoaXMgdGFzayBmcm9tIE5leHQgVXAgdG8gSW4gUHJvZ3Jlc3Mgb24gdGhlIEFuYWx5dGljcy1LYW5iYW4gYm9hcmQuCkNoYW5nZSBTdWJzY3JpYmVycwpDaGFuZ2UgUHJpb3JpdHkKQXNzaWduIC8gQ2xhaW0KTW92ZSBvbiBXb3JrYm9hcmQKQ2hhbmdlIFByb2plY3QgVGFncwpBbmFseXRpY3MtS2FuYmFuCtcKU2VjdXJpdHkK1wpXaWtpbWVkaWEtVkUtQ2FtcGFpZ25zIChTMi0yMDE4KQrXClNjYXAK1wpTY2FwIChTY2FwMy1BZG9wdGlvbi1QaGFzZTIpCtcKQWJ1c2VGaWx0ZXIK1wpEYXRhLXJlbGVhc2UK1wpIYXNodGFncwrXCkxhYnNEQi1BdWRpdG9yCtcKTGFkaWVzLVRoYXQtRk9TUy1NZWRpYVdpa2kK1wpMYW5ndWFnZS0yMDE4LUFwci1KdW5lCtcKTGFuZ3VhZ2UtMjAxOC1KYW4tTWFyCtcKSEhWTQrXCkhBV2VsY29tZQrXCkJvbGQKSXRhbGljcwpNb25vc3BhY2VkCkxpbmsKQnVsbGV0ZWQgTGlzdApOdW1iZXJlZCBMaXN0CkNvZGUgQmxvY2sKUXVvdGUKVGFibGUKVXBsb2FkIEZpbGUKTWVtZQpQcmV2aWV3CkhlbHAKRnVsbHNjcmVlbiBNb2RlClBpbiBGb3JtIE9uIFNjcmVlbgoyMzg0ODJuMzc1IGFkZGVkIHByb2plY3RzOiBTZWN1cml0eSwgV2lraW1lZGlhLVZFLUNhbXBhaWducyAoUzItMjAxOCksIFNjYXAgKFNjYXAzLUFkb3B0aW9uLVBoYXNlMiksIEFidXNlRmlsdGVyLCBEYXRhLXJlbGVhc2UsIEhhc2h0YWdzLCBMYWJzREItQXVkaXRvciwgTGFkaWVzLVRoYXQtRk9TUy1NZWRpYVdpa2ksIExhbmd1YWdlLTIwMTgtQXByLUp1bmUsIExhbmd1YWdlLTIwMTgtSmFuLU1hciwgSEhWTSwgSEFXZWxjb21lLlBSRVZJRVcKMjM4NDgybjM3NSBtb3ZlZCB0aGlzIHRhc2sgZnJvbSBJbiBQcm9ncmVzcyB0byBJbiBDb2RlIFJldmlldyBvbiB0aGUgQW5hbHl0aWNzLUthbmJhbiBib2FyZC4KMjM4NDgybjM3NSByZW1vdmVkIEpBbGxlbWFuZG91IGFzIHRoZSBhc3NpZ25lZSBvZiB0aGlzIHRhc2suCjIzODQ4Mm4zNzUgdHJpYWdlZCB0aGlzIHRhc2sgYXMgTG93ZXN0IHByaW9yaXR5LgoyMzg0ODJuMzc1IHJlbW92ZWQgc3Vic2NyaWJlcnM6IEFrbGFwcGVyLCBKQWxsZW1hbmRvdS4KQ29udGVudCBsaWNlbnNlZCB1bmRlciBDcmVhdGl2ZSBDb21tb25zIEF0dHJpYnV0aW9uLVNoYXJlQWxpa2UgMy4wIChDQy1CWS1TQSkgdW5sZXNzIG90aGVyd2lzZSBub3RlZDsgY29kZSBsaWNlbnNlZCB1bmRlciBHTlUgR2VuZXJhbCBQdWJsaWMgTGljZW5zZSAoR1BMKSBvciBvdGhlciBvcGVuIHNvdXJjZSBsaWNlbnNlcy4gQnkgdXNpbmcgdGhpcyBzaXRlLCB5b3UgYWdyZWUgdG8gdGhlIFRlcm1zIG9mIFVzZSwgUHJpdmFjeSBQb2xpY3ksIGFuZCBDb2RlIG9mIENvbmR1Y3QuILcgV2lraW1lZGlhIEZvdW5kYXRpb24gtyBQcml2YWN5IFBvbGljeSC3IENvZGUgb2YgQ29uZHVjdCC3IFRlcm1zIG9mIFVzZSC3IERpc2NsYWltZXIgtyBDQy1CWS1TQSC3IEdQTApZb3VyIGJyb3dzZXIgdGltZXpvbmUgc2V0dGluZyBkaWZmZXJzIGZyb20gdGhlIHRpbWV6b25lIHNldHRpbmcgaW4geW91ciBwcm9maWxlLCBjbGljayB0byByZWNvbmNpbGUu

238482n375 set Security to Software security bug.Jun 15 2018, 8:05 AM
238482n375 changed the visibility from "Public (No Login Required)" to "Custom Policy".

SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zLgoKICAgIFNlYXJjaAoKQ3JlYXRlIFRhc2sKTWFuaXBoZXN0ClQxOTcyODEKRml4IGZhaWxpbmcgd2VicmVxdWVzdCBob3VycyAodXBsb2FkIGFuZCB0ZXh0IDIwMTgtMDYtMTQtMTEpCk9wZW4sIE5lZWRzIFRyaWFnZVB1YmxpYwoKICAgIEVkaXQgVGFzawogICAgRWRpdCBSZWxhdGVkIFRhc2tzLi4uCiAgICBFZGl0IFJlbGF0ZWQgT2JqZWN0cy4uLgogICAgUHJvdGVjdCBhcyBzZWN1cml0eSBpc3N1ZQoKICAgIE11dGUgTm90aWZpY2F0aW9ucwogICAgQXdhcmQgVG9rZW4KICAgIEZsYWcgRm9yIExhdGVyCgpFVzZSC3IERpc2NsYWltZXIgtyBDQy1CWS1TQSC3IEdQTApZb3VyIGJyb3dzZXIgdGltZXpvbmUgc2V0dGluZyBkaWZmZXJzIGZyb20gdGhlIHRpbWV6b25lIHNldHRpbmcgaW4geW91ciBwcm9maWxlLCBjbGljayB0byByZWNvbmNpbGUu

Restricted Application added a project: Security. · View Herald TranscriptJun 15 2018, 2:25 PM
Aklapper changed the visibility from "Custom Policy" to "Public (No Login Required)".