Page MenuHomePhabricator

Alert on unmounted swift partitions
Open, Needs TriagePublic

Description

A few days ago we had another case on ms-be1022 where sdb failed a few commands shortly after reboot, however md was fine with it (the raid array was healthy) and swift-drive-audit unmounted the sdb3 partition because of the errors.

[   21.402450] sd 0:1:0:1: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   21.402455] sd 0:1:0:1: [sdb] tag#0 Sense Key : Aborted Command [current] 
[   21.402462] sd 0:1:0:1: [sdb] tag#0 Add. Sense: Information unit iuCRC error detected
[   21.402467] sd 0:1:0:1: [sdb] tag#0 CDB: Read(10) 28 00 00 13 9a e0 00 01 00 00
[   21.402470] blk_update_request: I/O error, dev sdb, sector 1284832
[   21.433211] md/raid1:md0: sdb1: rescheduling sector 1217248
[   21.462696] md/raid1:md0: redirecting sector 1217248 to other mirror: sda1
[   21.948002] Process accounting resumed
[   22.071853] sd 0:1:0:1: [sdb] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   22.071858] sd 0:1:0:1: [sdb] tag#9 Sense Key : Aborted Command [current] 
[   22.071862] sd 0:1:0:1: [sdb] tag#9 Add. Sense: Information unit iuCRC error detected
[   22.071866] sd 0:1:0:1: [sdb] tag#9 CDB: Read(10) 28 00 01 c1 0e 40 00 00 08 00
[   22.071869] blk_update_request: I/O error, dev sdb, sector 29429312
[   22.106236] md/raid1:md0: sdb1: rescheduling sector 29361728
[   22.137065] md/raid1:md0: redirecting sector 29361728 to other mirror: sda1
[   22.157770] x86/PAT: hpasmlited:1519 map pfn expected mapping type uncached-minus for [mem 0x79170000-0x79172fff], got write-back
[   22.157836] hpasmlited[1519]: segfault at 0 ip 000000000040605b sp 00007ffd6251a3d0 error 4 in hpasmlited[400000+3a000]
[   22.337399] sd 0:1:0:1: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   22.337404] sd 0:1:0:1: [sdb] tag#16 Sense Key : Aborted Command [current] 
[   22.337409] sd 0:1:0:1: [sdb] tag#16 Add. Sense: Information unit iuCRC error detected
[   22.337413] sd 0:1:0:1: [sdb] tag#16 CDB: Read(10) 28 00 06 15 f2 20 00 00 08 00
[   22.337415] blk_update_request: I/O error, dev sdb, sector 102101536
[   22.371127] md/raid1:md0: sdb1: rescheduling sector 102033952
[   22.401712] md/raid1:md0: redirecting sector 102033952 to other mirror: sda1
[   22.513377] sd 0:1:0:1: [sdb] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   22.513391] sd 0:1:0:1: [sdb] tag#1 Sense Key : Aborted Command [current] 
[   22.513397] sd 0:1:0:1: [sdb] tag#1 Add. Sense: Information unit iuCRC error detected
[   22.513413] sd 0:1:0:1: [sdb] tag#1 CDB: Read(10) 28 00 06 0c 0f 00 00 01 00 00
[   22.513417] blk_update_request: I/O error, dev sdb, sector 101453568
[   22.547196] md/raid1:md0: sdb1: rescheduling sector 101385984
[   22.583196] md/raid1:md0: redirecting sector 101385984 to other mirror: sda1
[   24.082390] tg3 0000:02:00.0 eno1: Link is up at 1000 Mbps, full duplex
[   24.082393] tg3 0000:02:00.0 eno1: Flow control is off for TX and off for RX
[   24.082396] tg3 0000:02:00.0 eno1: EEE is disabled
[   24.082424] IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready
[   26.138790] sd 0:1:0:1: [sdb] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   26.139033] sd 0:1:0:1: [sdb] tag#3 Sense Key : Aborted Command [current] 
[   26.139281] sd 0:1:0:1: [sdb] tag#3 Add. Sense: Information unit iuCRC error detected
[   26.139285] sd 0:1:0:1: [sdb] tag#3 CDB: Read(10) 28 00 0b 77 7f 60 00 00 20 00
[   26.139288] blk_update_request: I/O error, dev sdb, sector 192380768
[   26.177308] XFS (sdb3): metadata I/O error: block 0x45d8f60 ("xfs_trans_read_buf_map") error 5 numblks 32
[   26.231104] XFS (sdb3): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
[   26.389724] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   28.107978] sd 0:1:0:1: [sdb] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   28.108426] sd 0:1:0:1: [sdb] tag#4 Sense Key : Aborted Command [current] 
[   28.108642] sd 0:1:0:1: [sdb] tag#4 Add. Sense: Information unit iuCRC error detected
[   28.108649] sd 0:1:0:1: [sdb] tag#4 CDB: Read(10) 28 00 07 1c 18 c0 00 00 08 00
[   28.108880] blk_update_request: I/O error, dev sdb, sector 119281856
[   32.107158] perf: interrupt took too long (3782 > 2500), lowering kernel.perf_event_max_sample_rate to 52750
[   32.757273] sd 0:1:0:1: [sdb] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   32.757512] sd 0:1:0:1: [sdb] tag#4 Sense Key : Aborted Command [current] 
[   32.757741] sd 0:1:0:1: [sdb] tag#4 Add. Sense: Information unit iuCRC error detected
[   32.757746] sd 0:1:0:1: [sdb] tag#4 CDB: Read(10) 28 00 08 b7 18 20 00 00 20 00
[   32.757749] blk_update_request: I/O error, dev sdb, sector 146216992
[   32.793044] XFS (sdb3): metadata I/O error: block 0x19d2820 ("xfs_trans_read_buf_map") error 5 numblks 
32
[   32.846378] XFS (sdb3): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
[   33.262854] perf: interrupt took too long (6974 > 4727), lowering kernel.perf_event_max_sample_rate to 
28500
[   33.911284] hpsa 0000:08:00.0: Acknowledging event: 0xc0000000 (HP SSD Smart Path configuration change)
[   34.214012] hpsa 0000:08:00.0: scsi 0:1:0:0: updated Direct-Access     HP       LOGICAL VOLUME   RAID-0
 SSDSmartPathCap+ En+ Exp=1
[   34.214017] hpsa 0000:08:00.0: scsi 0:1:0:1: updated Direct-Access     HP       LOGICAL VOLUME   RAID-0
 SSDSmartPathCap+ En+ Exp=1
[   35.043374] sd 0:1:0:1: [sdb] tag#8 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   35.043379] sd 0:1:0:1: [sdb] tag#8 Sense Key : Aborted Command [current] 
[   35.043383] sd 0:1:0:1: [sdb] tag#8 Add. Sense: Information unit iuCRC error detected
[   35.043387] sd 0:1:0:1: [sdb] tag#8 CDB: Read(10) 28 00 07 6e 0e e8 00 00 20 00
[   35.043389] blk_update_request: I/O error, dev sdb, sector 124653288
[   40.350247] sd 0:1:0:1: [sdb] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   40.350251] sd 0:1:0:1: [sdb] tag#6 Sense Key : Aborted Command [current] 
[   40.350255] sd 0:1:0:1: [sdb] tag#6 Add. Sense: Information unit iuCRC error detected
[   40.350263] sd 0:1:0:1: [sdb] tag#6 CDB: Read(10) 28 00 0d a6 46 a0 00 00 20 00
[   40.350265] blk_update_request: I/O error, dev sdb, sector 229000864
[   40.385820] XFS (sdb3): metadata I/O error: block 0x68c56a0 ("xfs_trans_read_buf_map") error 5 numblks 
32
[   40.439088] XFS (sdb3): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
[   73.164231] sd 0:1:0:1: [sdb] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   73.164235] sd 0:1:0:1: [sdb] tag#10 Sense Key : Aborted Command [current] 
[   73.164240] sd 0:1:0:1: [sdb] tag#10 Add. Sense: Information unit iuCRC error detected
[   73.164245] sd 0:1:0:1: [sdb] tag#10 CDB: Read(10) 28 00 07 50 1b c8 00 00 20 00
[   73.164248] blk_update_request: I/O error, dev sdb, sector 122690504
[   79.703334] sd 0:1:0:1: [sdb] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   79.703339] sd 0:1:0:1: [sdb] tag#19 Sense Key : Aborted Command [current] 
[   79.703343] sd 0:1:0:1: [sdb] tag#19 Add. Sense: Information unit iuCRC error detected
[   79.703347] sd 0:1:0:1: [sdb] tag#19 CDB: Read(10) 28 00 07 57 27 18 00 00 40 00
[   79.703349] blk_update_request: I/O error, dev sdb, sector 123152152
[   79.797793] sd 0:1:0:1: [sdb] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   79.797796] sd 0:1:0:1: [sdb] tag#6 Sense Key : Aborted Command [current] 
[   79.797800] sd 0:1:0:1: [sdb] tag#6 Add. Sense: Information unit iuCRC error detected
[   79.797803] sd 0:1:0:1: [sdb] tag#6 CDB: Write(10) 2a 00 05 67 0a 00 00 02 00 00
[   79.797806] blk_update_request: I/O error, dev sdb, sector 90638848
[   89.717368] sd 0:1:0:1: [sdb] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   89.717372] sd 0:1:0:1: [sdb] tag#24 Sense Key : Aborted Command [current] 
[   89.717376] sd 0:1:0:1: [sdb] tag#24 Add. Sense: Information unit iuCRC error detected
[   89.717380] sd 0:1:0:1: [sdb] tag#24 CDB: Read(10) 28 00 07 54 ff c8 00 00 20 00
[   89.717382] blk_update_request: I/O error, dev sdb, sector 123011016
[   89.717382] blk_update_request: I/O error, dev sdb, sector 123011016
[   93.730369] perf: interrupt took too long (8848 > 8717), lowering kernel.perf_event_max_sample_rate to 
22500
[   96.904643] perf: interrupt took too long (11792 > 11060), lowering kernel.perf_event_max_sample_rate t
o 16750
[  113.076048] perf: interrupt took too long (14979 > 14740), lowering kernel.perf_event_max_sample_rate t
o 13250
[  113.386923] INFO: NMI handler (ghes_notify_nmi) took too long to run: 1.258 msecs
[  145.946025] XFS (sdb3): Unmounting Filesystem
[  163.617400] perf: interrupt took too long (20246 > 18723), lowering kernel.perf_event_max_sample_rate t
o 9750
[  264.669576] perf: interrupt took too long (26056 > 25307), lowering kernel.perf_event_max_sample_rate t
o 7500
[  316.150222] perf: interrupt took too long (33209 > 32570), lowering kernel.perf_event_max_sample_rate t
o 6000
[  432.176977] perf: interrupt took too long (43791 > 41511), lowering kernel.perf_event_max_sample_rate t
o 4500
[  682.427160] perf: interrupt took too long (54762 > 54738), lowering kernel.perf_event_max_sample_rate t
o 3500
[  796.533126] perf: interrupt took too long (71024 > 68452), lowering kernel.perf_event_max_sample_rate t
o 2750
[ 1085.764894] perf: interrupt took too long (89408 > 88780), lowering kernel.perf_event_max_sample_rate t
o 2000
[ 1376.715120] perf: interrupt took too long (111995 > 111760), lowering kernel.perf_event_max_sample_rate
 to 1750
[ 1945.956015] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.243 msecs
[ 2730.848907] perf: interrupt took too long (140897 > 139993), lowering kernel.perf_event_max_sample_rate
 to 1250
[ 2981.343975] INFO: NMI handler (ghes_notify_nmi) took too long to run: 1.414 msecs
[ 5057.478109] perf: interrupt took too long (176876 > 176121), lowering kernel.perf_event_max_sample_rate
 to 1000
[ 6441.538266] INFO: NMI handler (ghes_notify_nmi) took too long to run: 1.531 msecs
[ 7145.696648] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.372 msecs
[31764.090192] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.736 msecs
[31974.520758] perf: interrupt took too long (221870 > 221095), lowering kernel.perf_event_max_sample_rate
 to 750
[59233.752626] Process accounting resumed
[59752.857307] INFO: NMI handler (ghes_notify_nmi) took too long to run: 1.594 msecs

Event Timeline

This happened again on thanos-be1003 - where the sdm1 filesystem had errors and was unmounted, though no notifications specific notifications were issued. I've remounted the filesystem for now though

[20542046.608276] megaraid_sas 0000:3b:00.0: 2612 (698897181s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 0a(e0x20/s10) at 2c34ef79
[20542047.143016] sd 0:2:12:0: [sdm] tag#275 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[20542050.786273] megaraid_sas 0000:3b:00.0: 2613 (698897182s/0x0001/FATAL) - Uncorrectable medium error logged for VD 0c/b at 2c34ef79 (on PD 0a(e0x20/s10) at 2c34ef79)
[20542050.790018] sd 0:2:12:0: [sdm] tag#712 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[20542050.796914] sd 0:2:12:0: [sdm] tag#454 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[20542050.799423] sd 0:2:12:0: [sdm] tag#767 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[20542050.808946] sd 0:2:12:0: [sdm] tag#503 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[20542050.813917] sd 0:2:12:0: [sdm] tag#419 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[20542050.813928] sd 0:2:12:0: [sdm] tag#419 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[20542050.813932] sd 0:2:12:0: [sdm] tag#419 Sense Key : Medium Error [current] 
[20542050.813936] sd 0:2:12:0: [sdm] tag#419 Add. Sense: No additional sense information
[20542050.813940] sd 0:2:12:0: [sdm] tag#419 CDB: Read(16) 88 00 00 00 00 00 2c 34 ef 78 00 00 00 08 00 00
[20542050.813943] print_req_error: I/O error, dev sdm, sector 741666680
[20542050.820367] XFS (sdm1): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x2c34e778 len 8 error 5
[20542050.831178] sd 0:2:12:0: [sdm] tag#862 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[20542050.837820] sd 0:2:12:0: [sdm] tag#346 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[20542050.841866] sd 0:2:12:0: [sdm] tag#106 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[20542050.848251] sd 0:2:12:0: [sdm] tag#426 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[20542050.857233] sd 0:2:12:0: [sdm] tag#346 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[20542050.859415] sd 0:2:12:0: [sdm] tag#426 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[20542050.859420] sd 0:2:12:0: [sdm] tag#426 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[20542050.859423] sd 0:2:12:0: [sdm] tag#426 Sense Key : Medium Error [current] 
[20542050.859424] sd 0:2:12:0: [sdm] tag#426 Add. Sense: No additional sense information
[20542050.859426] sd 0:2:12:0: [sdm] tag#426 CDB: Read(16) 88 00 00 00 00 00 2c 34 ef 78 00 00 00 08 00 00
[20542050.859428] print_req_error: I/O error, dev sdm, sector 741666680
[20542050.865838] XFS (sdm1): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x2c34e778 len 8 error 5
[20545298.936324] XFS (sdm1): Unmounting Filesystem