Page MenuHomePhabricator

Smartctl errors for one kafka1012 disk
Closed, ResolvedPublic3 Story Points

Description

Received from cron:

This message was generated by the smartd daemon running on:

   host name:  kafka1012
   DNS domain: eqiad.wmnet

The following warning/error was logged by the smartd daemon:

Device: /dev/sdh, SMART Failure: FAILURE PREDICTION THRESHOLD EXCEEDED: ascq=0x5

Device info:
[SEAGATE  ST32000444SS     KS68], lu id: 0x5000c50025fcf02f, S/N: 9WM3F298, 2.00 TB

Double checked on the host:

elukey@kafka1012:~$ for el in `df -h | grep spool | cut -d " " -f 1`; do echo $el; sudo smartctl -a $el | grep defect; done
/dev/sdg1
Elements in grown defect list: 0
/dev/sdj1
Elements in grown defect list: 0
/dev/sdb1
Elements in grown defect list: 0
/dev/sdi1
Elements in grown defect list: 0
/dev/sdk1
Elements in grown defect list: 0
/dev/sdl1
Elements in grown defect list: 0
/dev/sdf1
Elements in grown defect list: 0
/dev/sdd1
Elements in grown defect list: 0
/dev/sde1
Elements in grown defect list: 0
/dev/sda3
Elements in grown defect list: 0
/dev/sdc3
Elements in grown defect list: 0
/dev/sdh1
Elements in grown defect list: 4061

The host is scheduled to be decommed during the next two quarters but I'd prefer to swap the disk in advance to avoid any service disruption.

Event Timeline

elukey created this task.Jun 27 2017, 7:54 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 27 2017, 7:54 AM
CitoplasmaX moved this task from Backlog (Later) to Incoming on the Analytics board.
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Jun 29 2017, 3:13 PM
RobH added a subscriber: RobH.Jun 29 2017, 6:42 PM

This system is out of warranty, and will require onsite spare disks to be used as replacement.

This system is out of warranty, and will require onsite spare disks to be used as replacement.

Yes please, do we need approvals or can we proceed?

RobH added a comment.Jun 30 2017, 3:52 PM

This system is out of warranty, and will require onsite spare disks to be used as replacement.

Yes please, do we need approvals or can we proceed?

No approvals needed, its what shelf spares are for. I just commented to save Chris the time of looking up warranty info.

Nuria moved this task from Incoming to Radar on the Analytics board.Jul 3 2017, 4:08 PM

@elukey, I have plenty of disks on-site...just let me know which slot number.

Mentioned in SAL (#wikimedia-operations) [2017-07-18T17:16:32Z] <ottomata> stopping kafka broker on kafka1012 to replace disk T168927

Ottomata added a subscriber: Ottomata.

disk replaced as spare. Mounted as /var/spool/kafka/h with UUID=247e0397-066b-4b5c-b6c3-cacd1ecf8cdd.

Kafka is back up and is replicating missing data from other brokers. Thanks yall!

Ottomata moved this task from Next Up to Done on the Analytics-Kanban board.Jul 18 2017, 5:35 PM

Mentioned in SAL (#wikimedia-operations) [2017-07-18T18:02:24Z] <ottomata> stopping kafka on kafka1012 again, i think we swapped the wrong disk T168927

Ah, we accidentally swapped the wrong disk. My fault.

We put the good one back in, took the defected one out, and put the spare back in the other slot. So /var/spool/kafka/g now has UUID 247e0397-066b-4b5c-b6c3-cacd1ecf8cdd and is resyncing.

Nuria closed this task as Resolved.Jul 24 2017, 8:40 PM
Nuria set the point value for this task to 3.