We have noticed that 14 of our ceph hosts are having sector errors on all the hard drives:
{P52904}
This errors are increasing over time, so it seems the drives are degrading (see the `changed` lines):
{P52903}
All those machines were bought in two batches, of which they are all the machines, so it might be a bad hard drive batch:
{T291987}
{T283888}
These hosts are in service, and we can't take them all out at the same time, so we'll have to coordinate to replace/debug.
I'll fill up the details with the logs/debugging from the wiki in a bit
Thanks!
List of affected hosts:
cloudcephosd1021 - back online
cloudcephosd1022 - back online
cloudcephosd1023 - back online
cloudcephosd1024 - back online
cloudcephosd1025 - back online
cloudcephosd1026 - drained, ready for upgrade
cloudcephosd1027 - drained, ready for upgrade
cloudcephosd1028 - drained, ready for upgrade
cloudcephosd1029 - to be drained
cloudcephosd1030 - to be drained
cloudcephosd1031 - to be drained
cloudcephosd1032 - to be drained
cloudcephosd1033 - to be drained
cloudcephosd1034 - to be drained