Page MenuHomePhabricator

Smartctl errors for one kafka1012 disk
Closed, ResolvedPublic3 Estimated Story Points

Description

Received from cron:

This message was generated by the smartd daemon running on:

   host name:  kafka1012
   DNS domain: eqiad.wmnet

The following warning/error was logged by the smartd daemon:

Device: /dev/sdh, SMART Failure: FAILURE PREDICTION THRESHOLD EXCEEDED: ascq=0x5

Device info:
[SEAGATE  ST32000444SS     KS68], lu id: 0x5000c50025fcf02f, S/N: 9WM3F298, 2.00 TB

Double checked on the host:

elukey@kafka1012:~$ for el in `df -h | grep spool | cut -d " " -f 1`; do echo $el; sudo smartctl -a $el | grep defect; done
/dev/sdg1
Elements in grown defect list: 0
/dev/sdj1
Elements in grown defect list: 0
/dev/sdb1
Elements in grown defect list: 0
/dev/sdi1
Elements in grown defect list: 0
/dev/sdk1
Elements in grown defect list: 0
/dev/sdl1
Elements in grown defect list: 0
/dev/sdf1
Elements in grown defect list: 0
/dev/sdd1
Elements in grown defect list: 0
/dev/sde1
Elements in grown defect list: 0
/dev/sda3
Elements in grown defect list: 0
/dev/sdc3
Elements in grown defect list: 0
/dev/sdh1
Elements in grown defect list: 4061

The host is scheduled to be decommed during the next two quarters but I'd prefer to swap the disk in advance to avoid any service disruption.

Event Timeline

This system is out of warranty, and will require onsite spare disks to be used as replacement.

This system is out of warranty, and will require onsite spare disks to be used as replacement.

Yes please, do we need approvals or can we proceed?

This system is out of warranty, and will require onsite spare disks to be used as replacement.

Yes please, do we need approvals or can we proceed?

No approvals needed, its what shelf spares are for. I just commented to save Chris the time of looking up warranty info.

@elukey, I have plenty of disks on-site...just let me know which slot number.

Mentioned in SAL (#wikimedia-operations) [2017-07-18T17:16:32Z] <ottomata> stopping kafka broker on kafka1012 to replace disk T168927

Ottomata added a subscriber: Ottomata.

disk replaced as spare. Mounted as /var/spool/kafka/h with UUID=247e0397-066b-4b5c-b6c3-cacd1ecf8cdd.

Kafka is back up and is replicating missing data from other brokers. Thanks yall!

Mentioned in SAL (#wikimedia-operations) [2017-07-18T18:02:24Z] <ottomata> stopping kafka on kafka1012 again, i think we swapped the wrong disk T168927

Ah, we accidentally swapped the wrong disk. My fault.

We put the good one back in, took the defected one out, and put the spare back in the other slot. So /var/spool/kafka/g now has UUID 247e0397-066b-4b5c-b6c3-cacd1ecf8cdd and is resyncing.

Nuria set the point value for this task to 3.