Page MenuHomePhabricator

Broken disk on aqs1001.eqiad.wmnet
Closed, ResolvedPublic

Description

Hi!

aqs1001.eqiad.wmnet's dmesg looks like this:

[4255326.341853] sd 0:0:7:0: [sdh] FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[4255326.341861] sd 0:0:7:0: [sdh] CDB:
[4255326.341864] Write(10): 2a 00 00 00 08 30 00 00 08 00
[4255326.341876] blk_update_request: I/O error, dev sdh, sector 2096
[4255326.348697] md: super_written gets error=-5, uptodate=0
[4255326.348706] md/raid10:md2: Disk failure on sdh1, disabling device.
md/raid10:md2: Operation continuing on 11 devices.
[4255326.362540] sd 0:0:7:0: [sdh] FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[4255326.362543] sd 0:0:7:0: [sdh] CDB:
[4255326.362545] Write(10): 2a 00 00 00 08 20 00 00 08 00
[4255326.362554] blk_update_request: I/O error, dev sdh, sector 2080
[4255326.369376] md: super_written gets error=-5, uptodate=0
[4255355.290999] RAID10 conf printout:
[4255355.291007]  --- wd:11 rd:12
[4255355.291012]  disk 0, wo:0, o:1, dev:sda3
[4255355.291015]  disk 1, wo:0, o:1, dev:sdb3
[4255355.291018]  disk 2, wo:0, o:1, dev:sdc1
[4255355.291021]  disk 3, wo:0, o:1, dev:sdd1
[4255355.291023]  disk 4, wo:0, o:1, dev:sde1
[4255355.291026]  disk 5, wo:0, o:1, dev:sdf1
[4255355.291029]  disk 6, wo:0, o:1, dev:sdg1
[4255355.291031]  disk 7, wo:1, o:0, dev:sdh1
[4255355.291034]  disk 8, wo:0, o:1, dev:sdi1
[4255355.291036]  disk 9, wo:0, o:1, dev:sdj1
[4255355.291039]  disk 10, wo:0, o:1, dev:sdk1
[4255355.291041]  disk 11, wo:0, o:1, dev:sdl1
[4255355.303817] RAID10 conf printout:
[4255355.303823]  --- wd:11 rd:12
[4255355.303826]  disk 0, wo:0, o:1, dev:sda3
[4255355.303828]  disk 1, wo:0, o:1, dev:sdb3
[4255355.303829]  disk 2, wo:0, o:1, dev:sdc1
[4255355.303830]  disk 3, wo:0, o:1, dev:sdd1
[4255355.303832]  disk 4, wo:0, o:1, dev:sde1
[4255355.303833]  disk 5, wo:0, o:1, dev:sdf1
[4255355.303835]  disk 6, wo:0, o:1, dev:sdg1
[4255355.303836]  disk 8, wo:0, o:1, dev:sdi1
[4255355.303838]  disk 9, wo:0, o:1, dev:sdj1
[4255355.303839]  disk 10, wo:0, o:1, dev:sdk1
[4255355.303840]  disk 11, wo:0, o:1, dev:sdl1
[4300246.061109] Process accounting resumed
[4386640.568324] Process accounting resumed

We'd need to replace the disk if possible.

Thanks!

Luca

Event Timeline

elukey created this task.Mar 24 2016, 9:42 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 24 2016, 9:42 AM

BTW, we will be replacing these nodes soon, see T124947.

Not sure how soon we'll get those in, or if the disk will need replaced before we do.

The server is out of warranty but I do have spare disks on-site if you want
to replace it. I do see your request for servers but that could be some
time before they arrive. No downtime is needed but iirc you have these in
SW Raid so I may need help identifying the physical drive.

@Cmjohnson: we decided to go ahead and replace the disk if possible! If you need help we'll help to identify the drive (@Ottomata might be the one due to timezones constraints).

Thanks!

I know we're supposed to convert these to SSDs soon, but I would sleep a lot easier if we fixed the disk. If another one fails we'll lose a lot of data and have to backfill for months.

Ja let’s do this. @cmjohnson1 ja?!

faidon triaged this task as High priority.Apr 7 2016, 1:55 PM

Chris reached out to me, he will be in the datacenter on Tuesday when I am
online (he’s off today and tomorrow, I’m off Monday).

@Cmjohnson has swapped the disk. Faidon helped get the device to show by doing

megacli -CfgForeign -Scan -a0
There are 1 foreign configuration(s) on controller 0.
...
megacli -CfgForeign -Clear -a0
Foreign configuration 0 is cleared on controller 0.

I then formatted /dev/sdh1 as linux raid and added it to the array via:

mdadm --manage /dev/md2 --add /dev/sdh1

Just for the record, after clearing the foreign config, megacli -PDMakeJBOD -PhysDrv\[32:7\] -a0 was also needed.

Cmjohnson closed this task as Resolved.Apr 15 2016, 1:40 PM

The disk has been replaced, resolving