Change Details

We have noticed that 14 of our ceph hosts are having sector errors on all the hard drives: {P52904} This errors are increasing over time, so it seems the drives are degrading (see the `changed` lines): {P52903} All those machines were bought in two batches, of which they are all the machines, so it might be a bad hard drive batch: {T291987} {T283888} These hosts are in service, and we can't take them all out at the same time, so we'll have to coordinate to replace/debug. Note that one of the drives (cloudcephosd1034-sdh) was replaced in {T316673} and it has not had any errors. I'll fill up the details with the logs/debugging from the wiki in a bit Thanks! List of affected hosts: cloudcephosd1021 - back online cloudcephosd1022 - back online cloudcephosd1023 - back online cloudcephosd1024 - back online cloudcephosd1025 - back online cloudcephosd1026 - back online cloudcephosd1027 - back online cloudcephosd1028 - back online cloudcephosd1029 - back online cloudcephosd1030 - back online cloudcephosd1031 - back online cloudcephosd1032 - back online cloudcephosd1033 - back online cloudcephosd1034 - back online == Current status * cloudcephosd1030 will have 4 disks shipped in to replace the current ones and ship those out ** Smartctl + performance tests of the new drives (taking advantage that they are empty, and for reference): {P58861} * cloudcephosd1034 will be used to run some performance tests on the original drives and the replaced drive ``` sudo fio \ --ioengine=libaio \ --direct=1 \ --numjobs=1 \ --time_based=1 \ --runtime=120 \ --rw=randread \ --bs=512k \ --iodepth=8 \ --name=test \ --group_reporting=1 \ --filename=/dev/sdb ``` Tests results: {P58897}