Page MenuHomePhabricator

investigate new restbase machine disks timeouts
Closed, ResolvedPublic

Description

restbase1007 reported errors on sdb, trying to confirm I'm running stress --hdd 100 --hdd-bytes 100M on /var/tmp (raid0) and sdb shows constantly-higher latencies than sda:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    2.99   96.08    0.00    0.93

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  210.00     0.00 106664.00  1015.85    82.78  380.72    0.00  380.72   4.76 100.00
sda               0.00    78.00    3.00  161.00    12.00 76140.00   928.68    41.11  139.59   76.00  140.77   3.07  50.40

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    1.62   98.38    0.00    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     1.00    0.00  255.00     0.00 130560.00  1024.00   138.72  527.89    0.00  527.89   3.92 100.00
sda               0.00     0.00    5.00  314.00    20.00 160768.00  1008.08    42.56  186.90   16.80  189.61   2.86  91.20

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    1.35   98.65    0.00    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  223.00     0.00 114176.00  1024.00   139.05  587.68    0.00  587.68   4.48 100.00
sda               0.00     0.00    2.00  240.00     8.00 122880.00  1015.60    20.58   89.90   18.00   90.50   3.34  80.80

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.07    0.00    1.28   98.66    0.00    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  224.00     0.00 114688.00  1024.00   140.65  589.96    0.00  589.96   4.46 100.00
sda               0.00     0.00    3.00  238.00    12.00 121856.00  1011.35    21.48   89.15    0.00   90.27   3.07  74.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    1.47   98.53    0.00    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     1.00    0.00  255.00     0.00 130560.00  1024.00   136.62  558.04    0.00  558.04   3.92 100.00
sda               0.00     0.00    8.00  229.00    32.00 117248.00   989.70    24.83   94.97   34.50   97.08   4.05  96.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.07    0.00    2.67   97.26    0.00    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00    41.00    0.00  295.00     0.00 149680.00  1014.78    63.05  359.63    0.00  359.63   3.39 100.00
sda               0.00     0.00    8.00  215.00    32.00 110080.00   987.55    20.75   98.78   60.50  100.20   2.85  63.60

Event Timeline

fgiunchedi claimed this task.
fgiunchedi raised the priority of this task from to Medium.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added projects: RESTBase, acl*sre-team.
fgiunchedi added subscribers: gerritbot, Aklapper, GWicke and 3 others.

@Cmjohnson sdb seems replacement (sort of DOA?) worthy, what do you think?

fgiunchedi added a project: ops-eqiad.
fgiunchedi set Security to None.

@Cmjohnson I'd like to test a theory re: sdb, can you swap sda and sdb? I'd like to see if the error moves too

[14908.351693] ata2.00: exception Emask 0x0 SAct 0x3fff000 SErr 0x0 action 0x6 frozen
[14908.360160] ata2.00: failed command: WRITE FPDMA QUEUED
[14908.366001] ata2.00: cmd 61/00:60:00:6c:2b/04:00:0a:00:00/40 tag 12 ncq 524288 out
         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[14908.382691] ata2.00: status: { DRDY }
[14908.386778] ata2.00: failed command: WRITE FPDMA QUEUED
[14908.392604] ata2.00: cmd 61/00:68:00:70:2b/04:00:0a:00:00/40 tag 13 ncq 524288 out
         res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[14908.409293] ata2.00: status: { DRDY }
[14908.413378] ata2.00: failed command: WRITE FPDMA QUEUED
[14908.419212] ata2.00: cmd 61/00:70:00:74:2b/04:00:0a:00:00/40 tag 14 ncq 524288 out
         res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[14908.435900] ata2.00: status: { DRDY }
[14908.439977] ata2.00: failed command: WRITE FPDMA QUEUED
[14908.445809] ata2.00: cmd 61/00:78:00:78:2b/04:00:0a:00:00/40 tag 15 ncq 524288 out
         res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[14908.462495] ata2.00: status: { DRDY }
[14908.466580] ata2.00: failed command: WRITE FPDMA QUEUED
[14908.472416] ata2.00: cmd 61/00:80:00:7c:2b/04:00:0a:00:00/40 tag 16 ncq 524288 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[14908.489103] ata2.00: status: { DRDY }
[14908.493187] ata2.00: failed command: WRITE FPDMA QUEUED
[14908.499015] ata2.00: cmd 61/d0:88:00:80:2b/02:00:0a:00:00/40 tag 17 ncq 368640 out
         res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[14908.515702] ata2.00: status: { DRDY }
[14908.519785] ata2.00: failed command: WRITE FPDMA QUEUED
[14908.525618] ata2.00: cmd 61/10:90:f0:0b:2a/04:00:0a:00:00/40 tag 18 ncq 532480 out
         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[14908.542304] ata2.00: status: { DRDY }
[14908.546388] ata2.00: failed command: WRITE FPDMA QUEUED
[14908.552220] ata2.00: cmd 61/00:98:00:10:2a/04:00:0a:00:00/40 tag 19 ncq 524288 out
         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[14908.568904] ata2.00: status: { DRDY }
[14908.572988] ata2.00: failed command: WRITE FPDMA QUEUED
[14908.578819] ata2.00: cmd 61/00:a0:00:14:2a/04:00:0a:00:00/40 tag 20 ncq 524288 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[14908.595504] ata2.00: status: { DRDY }
[14908.599589] ata2.00: failed command: WRITE FPDMA QUEUED
[14908.605418] ata2.00: cmd 61/00:a8:00:18:2a/04:00:0a:00:00/40 tag 21 ncq 524288 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[14908.622104] ata2.00: status: { DRDY }
[14908.626186] ata2.00: failed command: WRITE FPDMA QUEUED
[14908.632017] ata2.00: cmd 61/00:b0:00:1c:2a/04:00:0a:00:00/40 tag 22 ncq 524288 out
         res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[14908.648700] ata2.00: status: { DRDY }
[14908.652783] ata2.00: failed command: WRITE FPDMA QUEUED
[14908.658612] ata2.00: cmd 61/90:b8:98:75:03/00:00:3c:00:00/40 tag 23 ncq 73728 out
         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[14908.675200] ata2.00: status: { DRDY }
[14908.679283] ata2.00: failed command: WRITE FPDMA QUEUED
[14908.685114] ata2.00: cmd 61/01:c0:08:08:00/00:00:00:00:00/40 tag 24 ncq 512 out
         res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[14908.701509] ata2.00: status: { DRDY }
[14908.705593] ata2.00: failed command: WRITE FPDMA QUEUED
[14908.711423] ata2.00: cmd 61/08:c8:d0:f3:39/00:00:0d:00:00/40 tag 25 ncq 4096 out
         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[14908.727914] ata2.00: status: { DRDY }
[14908.732000] ata2: hard resetting link
[14914.089436] ata2: link is slow to respond, please be patient (ready=0)
[14918.735718] ata2: COMRESET failed (errno=-16)
[14918.740588] ata2: hard resetting link
[14924.097707] ata2: link is slow to respond, please be patient (ready=0)
[14928.743965] ata2: COMRESET failed (errno=-16)
[14928.748825] ata2: hard resetting link
[14934.109942] ata2: link is slow to respond, please be patient (ready=0)
[14963.782766] ata2: COMRESET failed (errno=-16)
[14963.787636] ata2: limiting SATA link speed to 3.0 Gbps
[14963.787639] ata2: hard resetting link
[14968.808903] ata2: COMRESET failed (errno=-16)
[14968.813775] ata2: reset failed, giving up
[14968.818255] ata2.00: disabled
[14968.818265] ata2.00: device reported invalid CHS sector 0
[14968.818272] ata2.00: device reported invalid CHS sector 0
[14968.818276] ata2.00: device reported invalid CHS sector 0
[14968.818287] ata2.00: device reported invalid CHS sector 0
[14968.818296] ata2.00: device reported invalid CHS sector 0
[14968.818318] ata2: EH complete
[14968.818361] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[14968.818364] sd 1:0:0:0: [sdb] CDB: 
[14968.818366] Write(10): 2a 00 0d 39 f3 d0 00 00 08 00
[14968.818378] blk_update_request: I/O error, dev sdb, sector 221901776
[14968.818809] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[14968.818811] sd 1:0:0:0: [sdb] CDB: 
[14968.818817] Read(10): 28 00 00 05 d8 80 00 00 08 00
[14968.818819] blk_update_request: I/O error, dev sdb, sector 383104
[14968.818822] md/raid1:md0: sdb1: rescheduling sector 348288
[14968.838396] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error -5 writing to inode 66198158 (offset 0 size 0 starting block 40272890)
[14968.838400] Buffer I/O error on device dm-0, logical block 40272890
[14968.845396] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[14968.845399] sd 1:0:0:0: [sdb] CDB: 
[14968.845400] Write(10): 2a 00 00 00 08 08 00 00 01 00
[14968.845408] blk_update_request: I/O error, dev sdb, sector 2056
[14968.847120] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[14968.847121] sd 1:0:0:0: [sdb] CDB: 
[14968.847126] Read(10): 28 00 43 87 ff d8 00 00 08 00
[14968.847128] blk_update_request: I/O error, dev sdb, sector 1132986328
[14968.847204] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[14968.847206] sd 1:0:0:0: [sdb] CDB: 
[14968.847211] Read(10): 28 00 43 87 ff d8 00 00 08 00
[14968.847213] blk_update_request: I/O error, dev sdb, sector 1132986328
[14968.849270] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[14968.849271] sd 1:0:0:0: [sdb] CDB: 
[14968.849276] Read(10): 28 00 43 87 ff d8 00 00 08 00
[14968.849278] blk_update_request: I/O error, dev sdb, sector 1132986328
[14968.849361] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[14968.849363] sd 1:0:0:0: [sdb] CDB: 
[14968.849368] Read(10): 28 00 43 87 ff d8 00 00 08 00
[14968.849370] blk_update_request: I/O error, dev sdb, sector 1132986328
[14968.849814] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[14968.849816] sd 1:0:0:0: [sdb] CDB: 
[14968.849822] Read(10): 28 00 43 87 ff d8 00 00 08 00
[14968.849824] blk_update_request: I/O error, dev sdb, sector 1132986328
[14968.849894] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[14968.849896] sd 1:0:0:0: [sdb] CDB: 
[14968.849901] Read(10): 28 00 43 87 ff d8 00 00 08 00
[14968.849903] blk_update_request: I/O error, dev sdb, sector 1132986328
[14968.895140] md: super_written gets error=-5, uptodate=0
[14968.895143] md/raid1:md0: Disk failure on sdb1, disabling device.
md/raid1:md0: Operation continuing on 1 devices.
[14968.908262] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[14968.908264] sd 1:0:0:0: [sdb] CDB: 
[14968.908265] Write(10): 2a 00 3c 03 75 98 00 00 90 00
[14968.908287] blk_update_request: I/O error, dev sdb, sector 1006859672
[14968.915535] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error -5 writing to inode 66199774 (offset 3256872960 size 5632000 starting block 27430528)
[14968.915539] Buffer I/O error on device dm-0, logical block 27430528
[14968.915542] Aborting journal on device dm-0-8.
[14968.915548] EXT4-fs error (device dm-0) in ext4_reserve_inode_write:4737: Journal has aborted
[14968.915550] EXT4-fs error (device dm-0) in ext4_reserve_inode_write:4737: Journal has aborted
[14968.915554] EXT4-fs error (device dm-0) in ext4_reserve_inode_write:4737: Journal has aborted
[14968.915557] EXT4-fs error (device dm-0) in ext4_reserve_inode_write:4737: Journal has aborted
[14968.920870] EXT4-fs error (device dm-0) in ext4_dirty_inode:4856: Journal has aborted
[14968.925653] EXT4-fs error (device dm-0) in ext4_dirty_inode:4856: Journal has aborted
[14968.930506] EXT4-fs error (device dm-0) in ext4_da_write_end:2654: Journal has aborted
[14968.934124] EXT4-fs error (device dm-0) in ext4_dirty_inode:4856: Journal has aborted
[14968.937711] EXT4-fs error (device dm-0) in ext4_da_write_end:2654: Journal has aborted
[14968.941313] EXT4-fs error (device dm-0) in ext4_da_write_end:2654: Journal has aborted
[14968.944909] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal
[14968.944911] EXT4-fs (dm-0): Remounting filesystem read-only
[14968.952112] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal
[14968.952293] Buffer I/O error on dev dm-0, logical block 264766098, lost async page write
[14968.952299] Buffer I/O error on dev dm-0, logical block 264798605, lost async page write
[14968.952302] Buffer I/O error on dev dm-0, logical block 264800491, lost async page write
[14968.952305] Buffer I/O error on dev dm-0, logical block 264800938, lost async page write
[14968.952308] Buffer I/O error on dev dm-0, logical block 264801534, lost async page write
[14969.088790] Buffer I/O error on device dm-0, logical block 27430529
[14969.095788] Buffer I/O error on device dm-0, logical block 27430530
[14969.102783] Buffer I/O error on device dm-0, logical block 27430531
[14969.109776] Buffer I/O error on device dm-0, logical block 27430532
[14969.116772] Buffer I/O error on device dm-0, logical block 27430533
[14969.123768] Buffer I/O error on device dm-0, logical block 27430534
[14969.130762] Buffer I/O error on device dm-0, logical block 27430535
[14969.137751] Buffer I/O error on device dm-0, logical block 27430536
[14969.144824] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error -5 writing to inode 66199774 (offset 3256872960 size 5632000 starting block 27430272)
[14969.144893] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error -5 writing to inode 66199774 (offset 3256872960 size 5632000 starting block 27430016)
[14969.144962] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error -5 writing to inode 66199774 (offset 3256872960 size 5632000 starting block 27429760)
[14969.145029] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error -5 writing to inode 66199774 (offset 3256872960 size 5632000 starting block 27429374)
[14969.145037] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error -5 writing to inode 66199774 (offset 3256872960 size 5632000 starting block 27429504)
[14969.145124] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error -5 writing to inode 66203228 (offset 4826075136 size 5607424 starting block 27453312)
[14969.145176] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error -5 writing to inode 66203228 (offset 4826075136 size 5607424 starting block 27453056)
[14969.145243] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error -5 writing to inode 66203228 (offset 4826075136 size 5607424 starting block 27452800)
[14969.145530] md/raid1:md0: sdb1: rescheduling sector 372856
[14969.153289] md/raid1:md0: redirecting sector 348288 to other mirror: sda1
[14969.166818] md/raid1:md0: redirecting sector 372856 to other mirror: sda1
[14969.174643] RAID1 conf printout:
[14969.174648]  --- wd:1 rd:2
[14969.174652]  disk 0, wo:0, o:1, dev:sda1
[14969.174656]  disk 1, wo:1, o:0, dev:sdb1
[14969.212686] RAID1 conf printout:
[14969.212691]  --- wd:1 rd:2
[14969.212695]  disk 0, wo:0, o:1, dev:sda1
[14982.145973] scsi_io_completion: 176 callbacks suppressed
[14982.145982] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[14982.145985] sd 1:0:0:0: [sdb] CDB: 
[14982.145987] Read(10): 28 00 03 9b e0 08 00 00 08 00
[14982.145995] blk_update_request: 176 callbacks suppressed
[14982.145997] blk_update_request: I/O error, dev sdb, sector 60547080
[14982.154023] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[14982.154027] sd 1:0:0:0: [sdb] CDB: 
[14982.154029] Read(10): 28 00 00 00 08 08 00 00 08 00
[14982.154036] blk_update_request: I/O error, dev sdb, sector 2056
[15088.336337] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15088.336342] sd 1:0:0:0: [sdb] CDB: 
[15088.336345] Read(10): 28 00 03 9b e0 08 00 00 08 00
[15088.336353] blk_update_request: I/O error, dev sdb, sector 60547080
[15088.344280] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15088.344283] sd 1:0:0:0: [sdb] CDB: 
[15088.344285] Read(10): 28 00 00 00 08 08 00 00 08 00
[15088.344293] blk_update_request: I/O error, dev sdb, sector 2056
[15196.636154] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15196.636160] sd 1:0:0:0: [sdb] CDB: 
[15196.636163] Read(10): 28 00 03 9b e0 08 00 00 08 00
[15196.636172] blk_update_request: I/O error, dev sdb, sector 60547080
[15196.644377] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15196.644383] sd 1:0:0:0: [sdb] CDB: 
[15196.644385] Read(10): 28 00 00 00 08 08 00 00 08 00
[15196.644395] blk_update_request: I/O error, dev sdb, sector 2056
[15311.442495] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15311.442500] sd 1:0:0:0: [sdb] CDB: 
[15311.442503] Read(10): 28 00 03 9b e0 08 00 00 08 00
[15311.442512] blk_update_request: I/O error, dev sdb, sector 60547080
[15311.450490] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15311.450494] sd 1:0:0:0: [sdb] CDB: 
[15311.450496] Read(10): 28 00 00 00 08 08 00 00 08 00
[15311.450503] blk_update_request: I/O error, dev sdb, sector 2056
[15416.470708] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15416.470713] sd 1:0:0:0: [sdb] CDB: 
[15416.470716] Read(10): 28 00 03 9b e0 08 00 00 08 00
[15416.470725] blk_update_request: I/O error, dev sdb, sector 60547080
[15416.478614] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15416.478618] sd 1:0:0:0: [sdb] CDB: 
[15416.478619] Read(10): 28 00 00 00 08 08 00 00 08 00
[15416.478626] blk_update_request: I/O error, dev sdb, sector 2056
[15520.702441] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15520.702446] sd 1:0:0:0: [sdb] CDB: 
[15520.702448] Read(10): 28 00 03 9b e0 08 00 00 08 00
[15520.702457] blk_update_request: I/O error, dev sdb, sector 60547080
[15520.710382] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15520.710386] sd 1:0:0:0: [sdb] CDB: 
[15520.710387] Read(10): 28 00 00 00 08 08 00 00 08 00
[15520.710395] blk_update_request: I/O error, dev sdb, sector 2056
[15607.027018] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15607.027024] sd 1:0:0:0: [sdb] CDB: 
[15607.027026] Read(10): 28 00 03 ac 12 30 00 00 08 00
[15607.027037] blk_update_request: I/O error, dev sdb, sector 61608496
[15607.034123] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15607.034129] sd 1:0:0:0: [sdb] CDB: 
[15607.034131] Read(10): 28 00 03 ac 12 30 00 00 08 00
[15607.034141] blk_update_request: I/O error, dev sdb, sector 61608496
[15631.001224] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15631.001230] sd 1:0:0:0: [sdb] CDB: 
[15631.001232] Read(10): 28 00 03 9b e0 08 00 00 08 00
[15631.001242] blk_update_request: I/O error, dev sdb, sector 60547080
[15631.009246] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15631.009250] sd 1:0:0:0: [sdb] CDB: 
[15631.009252] Read(10): 28 00 00 00 08 08 00 00 08 00
[15631.009260] blk_update_request: I/O error, dev sdb, sector 2056
[15637.996790] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15637.996795] sd 1:0:0:0: [sdb] CDB: 
[15637.996798] Read(10): 28 00 42 cd 4f e0 00 00 08 00
[15637.996807] blk_update_request: I/O error, dev sdb, sector 1120751584
[15638.004011] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15638.004013] sd 1:0:0:0: [sdb] CDB: 
[15638.004015] Read(10): 28 00 42 cd 5e 60 00 00 08 00
[15638.004022] blk_update_request: I/O error, dev sdb, sector 1120755296
[15638.011281] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15638.011286] sd 1:0:0:0: [sdb] CDB: 
[15638.011288] Read(10): 28 00 42 cd 5e 60 00 00 08 00
[15638.011298] blk_update_request: I/O error, dev sdb, sector 1120755296
[15638.018601] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15638.018606] sd 1:0:0:0: [sdb] CDB: 
[15638.018608] Read(10): 28 00 42 cd 5e 60 00 00 08 00
[15638.018617] blk_update_request: I/O error, dev sdb, sector 1120755296
[15682.683699] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15682.683706] sd 1:0:0:0: [sdb] CDB: 
[15682.683709] Read(10): 28 00 05 c3 e7 c0 00 00 40 00
[15682.683719] blk_update_request: I/O error, dev sdb, sector 96724928
[15682.690759] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15682.690769] sd 1:0:0:0: [sdb] CDB: 
[15682.690771] Read(10): 28 00 05 c3 e7 c0 00 00 08 00
[15682.690780] blk_update_request: I/O error, dev sdb, sector 96724928
[15737.597892] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15737.597897] sd 1:0:0:0: [sdb] CDB: 
[15737.597899] Read(10): 28 00 03 9b e0 08 00 00 08 00
[15737.597908] blk_update_request: I/O error, dev sdb, sector 60547080
[15737.605848] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15737.605852] sd 1:0:0:0: [sdb] CDB: 
[15737.605854] Read(10): 28 00 00 00 08 08 00 00 08 00
[15737.605862] blk_update_request: I/O error, dev sdb, sector 2056
[15839.241034] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15839.241040] sd 1:0:0:0: [sdb] CDB: 
[15839.241043] Read(10): 28 00 03 ac 12 30 00 00 08 00
[15839.241053] blk_update_request: I/O error, dev sdb, sector 61608496
[15844.047944] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15844.047951] sd 1:0:0:0: [sdb] CDB: 
[15844.047953] Read(10): 28 00 5e cd fc 00 00 00 18 00
[15844.047964] blk_update_request: I/O error, dev sdb, sector 1590557696
[15844.055303] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15844.055307] sd 1:0:0:0: [sdb] CDB: 
[15844.055309] Read(10): 28 00 5e cd fc 18 00 00 80 00
[15844.055317] blk_update_request: I/O error, dev sdb, sector 1590557720
[15844.062562] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15844.062565] sd 1:0:0:0: [sdb] CDB: 
[15844.062567] Read(10): 28 00 5e cd fc 00 00 00 08 00
[15844.062575] blk_update_request: I/O error, dev sdb, sector 1590557696
[15844.070570] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15844.070576] sd 1:0:0:0: [sdb] CDB: 
[15844.070578] Read(10): 28 00 5e cd fc 00 00 00 08 00
[15844.070588] blk_update_request: I/O error, dev sdb, sector 1590557696
[15844.879284] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15844.879290] sd 1:0:0:0: [sdb] CDB: 
[15844.879292] Read(10): 28 00 03 9b e0 08 00 00 08 00
[15844.879302] blk_update_request: I/O error, dev sdb, sector 60547080
[15844.887318] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15844.887321] sd 1:0:0:0: [sdb] CDB: 
[15844.887323] Read(10): 28 00 00 00 08 08 00 00 08 00
[15844.887331] blk_update_request: I/O error, dev sdb, sector 2056
[15845.370372] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15845.370378] sd 1:0:0:0: [sdb] CDB: 
[15845.370381] Read(10): 28 00 5e cd fc e8 00 00 08 00
[15845.370391] blk_update_request: I/O error, dev sdb, sector 1590557928
[15845.377665] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15845.377670] sd 1:0:0:0: [sdb] CDB: 
[15845.377673] Read(10): 28 00 5e cd fc e8 00 00 08 00
[15845.377683] blk_update_request: I/O error, dev sdb, sector 1590557928
[15848.446628] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15848.446635] sd 1:0:0:0: [sdb] CDB: 
[15848.446637] Read(10): 28 00 5e cd fc 00 00 00 08 00
[15848.446648] blk_update_request: I/O error, dev sdb, sector 1590557696
[15857.575975] scsi_io_completion: 1 callbacks suppressed
[15857.575984] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15857.575988] sd 1:0:0:0: [sdb] CDB: 
[15857.575990] Read(10): 28 00 42 c7 cb 60 00 00 08 00
[15857.575999] blk_update_request: 1 callbacks suppressed
[15857.576002] blk_update_request: I/O error, dev sdb, sector 1120389984
[15857.583204] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15857.583207] sd 1:0:0:0: [sdb] CDB: 
[15857.583208] Read(10): 28 00 42 c7 cb a8 00 00 08 00
[15857.583217] blk_update_request: I/O error, dev sdb, sector 1120390056
[15857.590416] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15857.590418] sd 1:0:0:0: [sdb] CDB: 
[15857.590419] Read(10): 28 00 42 c7 da 60 00 00 08 00
[15857.590427] blk_update_request: I/O error, dev sdb, sector 1120393824
[15857.597620] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15857.597622] sd 1:0:0:0: [sdb] CDB: 
[15857.597624] Read(10): 28 00 42 c7 db 40 00 00 08 00
[15857.597631] blk_update_request: I/O error, dev sdb, sector 1120394048
[15857.597684] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15857.597685] sd 1:0:0:0: [sdb] CDB: 
[15857.597690] Read(10): 28 00 42 c7 da 60 00 00 08 00
[15857.597692] blk_update_request: I/O error, dev sdb, sector 1120393824
[15857.597751] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15857.597752] sd 1:0:0:0: [sdb] CDB: 
[15857.597756] Read(10): 28 00 42 c7 da 60 00 00 08 00
[15857.597757] blk_update_request: I/O error, dev sdb, sector 1120393824
[15858.607856] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15858.607862] sd 1:0:0:0: [sdb] CDB: 
[15858.607865] Read(10): 28 00 42 c7 da 60 00 00 08 00
[15858.607874] blk_update_request: I/O error, dev sdb, sector 1120393824
[15858.948842] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15858.948849] sd 1:0:0:0: [sdb] CDB: 
[15858.948851] Read(10): 28 00 42 c7 da 60 00 00 08 00
[15858.948861] blk_update_request: I/O error, dev sdb, sector 1120393824
[15859.661527] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15859.661534] sd 1:0:0:0: [sdb] CDB: 
[15859.661536] Read(10): 28 00 42 c7 da 60 00 00 08 00
[15859.661547] blk_update_request: I/O error, dev sdb, sector 1120393824
[15925.920487] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15925.920493] sd 1:0:0:0: [sdb] CDB: 
[15925.920496] Read(10): 28 00 43 87 ff d0 00 00 08 00
[15925.920506] blk_update_request: I/O error, dev sdb, sector 1132986320
[15925.927731] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15925.927740] sd 1:0:0:0: [sdb] CDB: 
[15925.927741] Read(10): 28 00 43 87 ff d0 00 00 08 00
[15925.927750] blk_update_request: I/O error, dev sdb, sector 1132986320
[15926.934943] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15926.934948] sd 1:0:0:0: [sdb] CDB: 
[15926.934950] Read(10): 28 00 42 cd 5e 60 00 00 08 00
[15926.934960] blk_update_request: I/O error, dev sdb, sector 1120755296
[15951.098013] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15951.098019] sd 1:0:0:0: [sdb] CDB: 
[15951.098021] Read(10): 28 00 03 9b e0 08 00 00 08 00
[15951.098031] blk_update_request: I/O error, dev sdb, sector 60547080
[15951.106030] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[15951.106034] sd 1:0:0:0: [sdb] CDB: 
[15951.106036] Read(10): 28 00 00 00 08 08 00 00 08 00
[15951.106043] blk_update_request: I/O error, dev sdb, sector 2056
[16058.182193] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[16058.182199] sd 1:0:0:0: [sdb] CDB: 
[16058.182202] Read(10): 28 00 03 9b e0 08 00 00 08 00
[16058.182212] blk_update_request: I/O error, dev sdb, sector 60547080
[16058.190329] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[16058.190334] sd 1:0:0:0: [sdb] CDB: 
[16058.190337] Read(10): 28 00 00 00 08 08 00 00 08 00
[16058.190346] blk_update_request: I/O error, dev sdb, sector 2056
[16120.943238] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[16120.943244] sd 1:0:0:0: [sdb] CDB: 
[16120.943247] Read(10): 28 00 42 c7 da 60 00 00 08 00
[16120.943257] blk_update_request: I/O error, dev sdb, sector 1120393824
[16120.950526] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[16120.950530] sd 1:0:0:0: [sdb] CDB: 
[16120.950531] Read(10): 28 00 42 c7 da 60 00 00 08 00
[16120.950540] blk_update_request: I/O error, dev sdb, sector 1120393824
[16121.686477] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[16121.686482] sd 1:0:0:0: [sdb] CDB: 
[16121.686484] Read(10): 28 00 42 c7 da 60 00 00 08 00
[16121.686490] blk_update_request: I/O error, dev sdb, sector 1120393824

These disks were purchased separate from the servers. I will need to talk with @RobH and see if we have any warranties.

The Samsung SSDs normally come with a 5-year warranty from Samsung itself.

In restbase1009 it was sda that failed, ending our sdb breakage streak but bringing our broken new nodes to 100%. Could still be controller / cabling issues (which a disk swap could rule out), but I think a bad batch of disks is becoming more likely.

@Cmjohnson thanks! Also I'd like to try swapping disks on say restbase1008 and see if the failures follow the disk

sadly I'm seeing failures for sda on restbase1008 too, @Cmjohnson perhaps a cable check might be in order, I don't think the swap will tell us much at this point

[274243.031383] blk_update_request: I/O error, dev sda, sector 58593288
[274243.038607] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[274243.038613] sd 1:0:0:0: [sdb] CDB: 
[274243.038615] Read(10): 28 00 03 7e 10 08 00 00 08 00
[274243.038625] blk_update_request: I/O error, dev sdb, sector 58593288
[274243.047035] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[274243.047041] sd 0:0:0:0: [sda] CDB: 
[274243.047044] Read(10): 28 00 00 00 08 08 00 00 08 00
[274243.047054] blk_update_request: I/O error, dev sda, sector 2056
[274243.057910] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[274243.057915] sd 0:0:0:0: [sda] CDB: 
[274243.057918] Read(10): 28 00 03 9b e0 08 00 00 08 00
[274243.057927] blk_update_request: I/O error, dev sda, sector 60547080
[274243.065149] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[274243.065154] sd 1:0:0:0: [sdb] CDB: 
[274243.065156] Read(10): 28 00 03 9b e0 08 00 00 08 00
[274243.065166] blk_update_request: I/O error, dev sdb, sector 60547080
[274243.073576] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[274243.073582] sd 0:0:0:0: [sda] CDB: 
[274243.073584] Read(10): 28 00 03 7e 10 08 00 00 08 00
[274243.073595] blk_update_request: I/O error, dev sda, sector 58593288
[274243.080815] sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[274243.080820] sd 1:0:0:0: [sdb] CDB:
fgiunchedi renamed this task from investigate restbase1007 sdb failure to investigate new restbase machine disks timeouts.Jun 22 2015, 10:24 PM

@Cmjohnson, change of plans! we should try with spare big SSDs first and see if failures come back, please swap current SSDs on restbase1008 with two spare big (>=500G) SSDs

So we cannot send these back to our vendor (HP) for coverage. They have to go back to Samsung, via the normal process anyone wants to send back to them, via their website and weeks of mailing and waiting for repair disks to arrive.

We've not done this before, so it'll have to be looked up on the website on how to process, etc... We'll have to provide a proof of purchase, so we can create another task (with the private permissions) for getting that invoice from accounting.

I haven't followed all of the lead-up to this hardware deployment closely, but I was under the impression we already had a well-tested, reliable solution for high-performance SSD drives in scalable clusters that we've been happy with for some time: the Intel S3700 series. We have ~200 of those drives in production in the edge cache machines alone, and I don't know how many elsewhere. Why on earth did we switch to these Samsungs here?

@BBlack, we also have 18x1T Samsung 840 Evos in the Cassandra cluster, which have worked well so far and cost significantly less per TB (about 4x) than the Intels. This matters at the storage capacities we anticipate to need for whole-history storage.

@RobH, could we order a set of Evo 850 spares from a different batch, so that we can process the warranty asynchronously?

@fgiunchedi I have swapped the 2 disks with 800Gb Intel s3700 ssds.

@BBlack, we also have 18x1T Samsung 840 Evos in the Cassandra cluster, which have worked well so far and cost significantly less per TB (about 4x) than the Intels. This matters at the storage capacities we anticipate to need for whole-history storage.

Worked well? We have zero-to-little data on that in the big picture. There are a lot of subtleties to picking a good SSD in terms of long-term perf for the workload and durability related to how the vendor's internal over-provisioning and firmware algorithms work. I'm hearing stuff now about these drives and TRIM issues, linux blacklisting, etc. Then there's the failure reports above.

Even if you don't care about the delay of resolving these issues in system config / kernel support or the unproven risks in their long-term use, It's basically choosing to offset disk cost reduction with ops' labor/time/pain increase (vs the known-good solution).

I am going to RMA the 2 Samsung disks from restbase1008

@Cmjohnson we'll run a few tests today/tomorrow to rule out the disks vs other components btw, I'm not 100% sure the disks are faulty yet

update from T102015, we were able to reproduce the bootstrapping failures even with intel disks which however haven't timeout/failed so far so it indeed looks like the disks are at fault here. We've observed similar timeout/failure on other machines, I wonder if it makes sense to RMA all disks? Also I wonder what happens if they are not RMA-ble for some reason.

@fgiunchedi, I think with 4/6 disks explicitly broken it's pretty clear that the entire batch of Samsung disks is DOA. I'd RMA them all, and perhaps order a new set plus 1-2 spares to speed up the process. Once the RMA has gone through, we'll then have 6+ spare large SSDs, some of which we can use towards the next expansion round.

@RobH, @fgiunchedi: Sorry if you sent a link already, but is there a procurement ticket for the SSDs? What is the status on that order?

https://rt.wikimedia.org/Ticket/Display.html?id=9473 tracks the order. summary: 3 of the 6 are @ eqiad, and the other 3 will be onsite either tomorrow or Friday

@RobH, thanks!

@fgiunchedi, @Cmjohnson: It might be worth installing and possibly testing those new SSDs in one of the boxes that currently have broken disks. For testing we'd have to write 400+G, based on how the previous disks failed. I could give that a try, possibly using a tool like https://github.com/ncw/stressdisk.

The new Samsung Pros have been added to restbase1007 and restbase1009. restbase1008 has the intels.

I've reimaged 1007 and 1009, currently running stressdisk on /var/tmp

no error shown after 25h of testing (fill up to 100% and read back the contents). the bytes read/written per second figures are skewed by being calculated on elapsed time whereas the first phase is only writes and the second phase is all reads

2015/07/28 12:42:45 
Bytes read:      70298828 MByte ( 764.63 MByte/s)
Bytes written:    1769154 MByte (  19.24 MByte/s)
Errors:                 0
Elapsed time:  25h32m18.933220034s

2015/07/28 12:42:45 PASSED with no errors
restbase1007:~$
2015/07/28 12:42:53 
Bytes read:      77817580 MByte ( 849.08 MByte/s)
Bytes written:    1769160 MByte (  19.30 MByte/s)
Errors:                 0
Elapsed time:  25h27m29.665311837s

2015/07/28 12:42:53 PASSED with no errors
restbase1009:~$

I am going to RMA the 2 Samsung disks from restbase1008

@Cmjohnson the new ssd seem to be working fine from our testing, you can RMA the first batch of SSDs we got

@Cmjohnson also please swap the intel SSD on restbase1008 with new ones we got last, thanks!

Swapped the ssds in restbase1008 -- will create a new task to RMA the other ssds