Page MenuHomePhabricator

Smartctl disk defects on kafka1012
Closed, ResolvedPublic3 Estimated Story Points

Description

Today after rebooting kafka1012 I had to run fsck on /dev/sdf1 to correct some issues. I ran smartctl -a and this is the result:

elukey@kafka1012:~$ for el in `df -h | grep spool | cut -d " " -f 1`; do echo $el; sudo smartctl -a $el | grep defect; done
/dev/sdg1
Elements in grown defect list: 0
/dev/sdd1
Elements in grown defect list: 0
/dev/sdl1
Elements in grown defect list: 0
/dev/sde1
Elements in grown defect list: 0
/dev/sdk1
Elements in grown defect list: 0
/dev/sdh1
Elements in grown defect list: 0
/dev/sdc1
Elements in grown defect list: 0
/dev/sdj1
Elements in grown defect list: 0
/dev/sdi1
Elements in grown defect list: 0
/dev/sdf1
Elements in grown defect list: 1425
/dev/sda3
Elements in grown defect list: 0
/dev/sdb3
Elements in grown defect list: 0

The disks seem to be JBOD:

elukey@kafka1012:~$ sudo megacli -AdpAllInfo -aALL

                Device Present
                ================
Virtual Drives    : 0
  Degraded        : 0
  Offline         : 0
Physical Devices  : 14
  Disks           : 12
  Critical Disks  : 0
  Failed Disks    : 0

The disk is now in service but I think it would be wise to replace it before it fails completely. We need to schedule downtime for this server since it is part of the Analytics Kafka cluster.

Event Timeline

@Cmjohnson: Hi! Any idea if we could replace the disk during the next two weeks? Thanks!

Ja anytime, we can stop this server with no service downtime, just have to
be ready to do it.

@elukey We can do it whenever you want. I have disks on-site. Let me know a
good day and time.

Let’s do today. Apparently analytics1049 has a bad disk too. Maybe we
can do them together! Ping either elukey or I when you are online and
ready at the datacenter.

Ottomata triaged this task as Medium priority.Jun 9 2016, 2:07 PM
Ottomata set the point value for this task to 3.
Ottomata moved this task from Next Up to Done on the Analytics-Kanban board.