Page MenuHomePhabricator

db1127 possible storage problems
Closed, ResolvedPublic

Description

RAID controller logs:

Device ID: 0
Enclosure Index: 32
Slot Number: 0
CDB Length: 10
CDB Data:
002a 0000 00ba 003a 004c 0070 0000 0000 0010 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18
Sense Data:
0070 0000 0006 0000 0000 0000 0000 000a 0000 0000 0000 0000 0029 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

seqNum: 0x00000a31
Time: Fri May 13 20:19:36 2022

Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 04(e0x20/s4) Path 500056b3d6cab5c4, CDB: 2a 00 00 d0 4c 00 00 02 00 00, Sense: 6/29/00
Event Data:
===========
Device ID: 4
Enclosure Index: 32
Slot Number: 4
CDB Length: 10
CDB Data:
002a 0000 0000 00d0 004c 0000 0000 0002 0000 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18
Sense Data:
0070 0000 0006 0000 0000 0000 0000 000a 0000 0000 0000 0000 0029 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

seqNum: 0x00000a32
Time: Wed May 18 09:33:10 2022

Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 01(e0x20/s1) Path 500056b3d6cab5c1, CDB: 2a 00 57 39 86 00 00 02 00 00, Sense: 6/29/00
Event Data:
===========
Device ID: 1
Enclosure Index: 32
Slot Number: 1
CDB Length: 10
CDB Data:
002a 0000 0057 0039 0086 0000 0000 0002 0000 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18
Sense Data:
0070 0000 0006 0000 0000 0000 0000 000a 0000 0000 0000 0000 0029 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

seqNum: 0x00000a33
Time: Thu May 19 19:17:39 2022

Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 03(e0x20/s3) Path 500056b3d6cab5c3, CDB: 28 00 44 79 5b 38 00 00 10 00, Sense: 6/29/00
Event Data:
===========
Device ID: 3
Enclosure Index: 32
Slot Number: 3
CDB Length: 10
CDB Data:
0028 0000 0044 0079 005b 0038 0000 0000 0010 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18
Sense Data:
0070 0000 0006 0000 0000 0000 0000 000a 0000 0000 0000 0000 0029 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

seqNum: 0x00000a34
Time: Sat May 21 06:10:06 2022

Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 00(e0x20/s0) Path 500056b3d6cab5c0, CDB: 2a 00 04 3a dc 00 00 02 00 00, Sense: 6/29/00
Event Data:
===========
Device ID: 0
Enclosure Index: 32
Slot Number: 0
CDB Length: 10
CDB Data:
002a 0000 0004 003a 00dc 0000 0000 0002 0000 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18
Sense Data:
0070 0000 0006 0000 0000 0000 0000 000a 0000 0000 0000 0000 0029 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
root@db1127:~# megacli -pdlist -a0 | grep -i error
Media Error Count: 0
Other Error Count: 5
Media Error Count: 0
Other Error Count: 3
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 2
Media Error Count: 0
Other Error Count: 1
Media Error Count: 0
Other Error Count: 3

The host paged for delay and keeps getting delayed

Event Timeline

Marostegui moved this task from Triage to In progress on the DBA board.

Needs investigation. For now the host is disabled and with notifications disabled.

The host keeps lagging so it looks clear that it might be a HW issue. I am investigating.

The idrac password also looks out of sync

I have rebooted the host, the disk error counter is now reseted. So far the controller logs look clean again and MySQL is catching up.

We probably need to do a firmware upgrade anyways, just in case we need to contact support.

There are no more disks errors, so I am going to migrate update mariadb to 10.6.8 (T308915) and repool it. If this happens again we can reopen.

Host being repooled automatically. Closing this for now.