We have noticed that 14 of our ceph hosts are having sector errors on all the hard drives:
{P52904}
This errors are increasing over time, so it seems the drives are degrading (see the `changed` lines):
{P52903}
All those machines were bought in two batches, of which they are all the machines, so it might be a bad hard drive batch:
{T291987}
{T283888}
These hosts are in service, and we can't take them all out at the same time, so we'll have to coordinate to replace/debug.
Note that one of the drives (cloudcephosd1034-sdh) was replaced in {T316673} and it has not had any errors.
I'll fill up the details with the logs/debugging from the wiki in a bit
Thanks!
List of affected hosts:
cloudcephosd1021 - back online
cloudcephosd1022 - back online
cloudcephosd1023 - back online
cloudcephosd1024 - back online
cloudcephosd1025 - back online
cloudcephosd1026 - back online
cloudcephosd1027 - back online
cloudcephosd1028 - back online
cloudcephosd1029 - back online
cloudcephosd1030 - back online
cloudcephosd1031 - back online
cloudcephosd1032 - back online
cloudcephosd1033 - back online
cloudcephosd1034 - back online
== Current status
* cloudcephosd1030 will have 4 disks shipped in to replace the current ones and ship those out
** Smartctl + performance tests of the new drives (taking advantage that they are empty, and for reference):
{P58861}
* cloudcephosd1034 will be used to run some performance tests on the original drives and the replaced drive
```
sudo fio \
--ioengine=libaio \
--direct=1 \
--numjobs=1 \
--time_based=1 \
--runtime=120 \
--rw=randread \
--bs=512k \
--iodepth=8 \
--name=test \
--group_reporting=1 \
--filename=/dev/sdb
```