Page MenuHomePhabricator

SMART errors on kafka1012.eqiad.wmfnet
Closed, ResolvedPublic

Description

Hello!

We received a SMART notification about kafka1012.eqiad.wmfnet, and the dmesg reports this:

[19958.049571] sd 0:0:5:0: [sdf] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[19958.049583] sd 0:0:5:0: [sdf] Sense Key : Aborted Command [current]
[19958.049591] sd 0:0:5:0: [sdf] Add. Sense: Ack/nak timeout
[19958.049593] sd 0:0:5:0: [sdf] CDB:
[19958.049596] Read(10): 28 00 96 80 0a 18 00 00 08 00
[19958.049604] blk_update_request: I/O error, dev sdf, sector 2524973592
[38257.716894] sd 0:0:5:0: [sdf] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[38257.716899] sd 0:0:5:0: [sdf] Sense Key : Aborted Command [current]
[38257.716903] sd 0:0:5:0: [sdf] Add. Sense: Ack/nak timeout
[38257.716905] sd 0:0:5:0: [sdf] CDB:
[38257.716907] Read(10): 28 00 54 00 0b d0 00 00 38 00
[38257.716914] blk_update_request: I/O error, dev sdf, sector 1409289168

elukey@kafka1012:~$ cat /proc/mounts | grep sdf
/dev/sdf1 /var/spool/kafka/f ext4 rw,noatime,data=writeback 0 0

We use the sdf1 partition for the Kafka broker's log, and we should pay a attention since this host has already caused https://phabricator.wikimedia.org/T125084 :D

Thanks!

Luca

Event Timeline

elukey raised the priority of this task from to Needs Triage.
elukey updated the task description. (Show Details)
elukey added a project: ops-eqiad.
elukey added subscribers: elukey, Ottomata.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

Change 267243 had a related patch set uploaded (by Mforns):
Remove kafka1012 from EventLogging brokers array

https://gerrit.wikimedia.org/r/267243

Kafka stopped on the node, no more services actively running on it.

@Cmjohnson are you in the DC today? Can we get this disk swapped asap?
Thanks!

Ottomata set Security to None.

Change 267243 merged by Ottomata:
Remove kafka1012 from EventLogging brokers array

https://gerrit.wikimedia.org/r/267243

Next steps before closing:

  1. disk replaced
  2. bring the host/service up and running again
  3. evaluates the following reverts:
  4. https://gerrit.wikimedia.org/r/267243 (event logging)
  5. https://gerrit.wikimedia.org/r/#/c/267295/
  6. https://gerrit.wikimedia.org/r/#/c/267022 (related to https://phabricator.wikimedia.org/T125084)

Temporary assigned to Cmjohnson

We keep receiving emails from smartd related to failures for other drives on kafka1012, so I tried to run smartctl -l short tests for two disk and didn't find any evident errors. I started the long test just in case, still running (will complete in some hours).

I took a look to syslog and found (a lot of):

root@kafka1012:/home/elukey# zgrep -i smart cd /var/log/syslog* | grep -B 2 "Sending"

/var/log/syslog.6.gz:Jan 29 12:40:38 kafka1012 smartd[928]: Device: /dev/sdk, failed to read Temperature
/var/log/syslog.6.gz:Jan 29 12:40:57 kafka1012 smartd[928]: Device: /dev/bus/0 [megaraid_disk_10], failed to read Temperature
/var/log/syslog.6.gz:Jan 29 13:10:28 kafka1012 smartd[928]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
--
/var/log/syslog.6.gz:Jan 29 14:40:36 kafka1012 smartd[928]: Device: /dev/sdk, failed to read Temperature
/var/log/syslog.6.gz:Jan 29 14:40:52 kafka1012 smartd[928]: Device: /dev/bus/0 [megaraid_disk_10], failed to read Temperature
/var/log/syslog.6.gz:Jan 29 15:10:26 kafka1012 smartd[928]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
--
/var/log/syslog.7.gz:Jan 28 18:10:38 kafka1012 smartd[928]: Device: /dev/sdf, failed to read Temperature
/var/log/syslog.7.gz:Jan 28 21:40:57 kafka1012 smartd[928]: Device: /dev/bus/0 [megaraid_disk_10], failed to read SMART values
/var/log/syslog.7.gz:Jan 28 21:40:57 kafka1012 smartd[928]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
--
/var/log/syslog.7.gz:Jan 29 02:10:56 kafka1012 smartd[928]: Device: /dev/bus/0 [megaraid_disk_10], failed to read Temperature
/var/log/syslog.7.gz:Jan 29 02:40:37 kafka1012 smartd[928]: Device: /dev/sdk, failed to read SMART values
/var/log/syslog.7.gz:Jan 29 02:40:37 kafka1012 smartd[928]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...

This error is not present in other nodes from what I can see. The configuration of the disks is:

root@kafka1012:/home/elukey# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             10M     0   10M   0% /dev
tmpfs           9.5G  418M  9.1G   5% /run
/dev/md0         28G  5.3G   21G  21% /
tmpfs            24G     0   24G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            24G     0   24G   0% /sys/fs/cgroup
/dev/sdc1       1.8T  104G  1.7T   6% /var/spool/kafka/c
/dev/sdj1       1.8T  621G  1.2T  34% /var/spool/kafka/j
/dev/sdd1       1.8T  608G  1.2T  34% /var/spool/kafka/d
/dev/sdi1       1.8T  535G  1.3T  30% /var/spool/kafka/i
/dev/sdh1       1.8T  603G  1.3T  33% /var/spool/kafka/h
/dev/sdk1       1.8T  525G  1.3T  29% /var/spool/kafka/k
/dev/sde1       1.8T  8.6G  1.8T   1% /var/spool/kafka/e
/dev/sdg1       1.8T  532G  1.3T  29% /var/spool/kafka/g
/dev/sdb3       1.8T 1023G  783G  57% /var/spool/kafka/b
/dev/sda3       1.8T  515G  1.3T  29% /var/spool/kafka/a
/dev/sdf1       1.8T  1.1T  733G  61% /var/spool/kafka/f
/dev/sdl1       1.8T  619G  1.2T  34% /var/spool/kafka/l

And I guess that the megaraid devices for /dev/bus0 (megaraid_disk_10, etc..) are the /dev/sdX discs. Could it be a problem with the MegaRaid controller? We definitely had issues with this host and the data served to its consumers, so something is happening, but I don't know what..

I do have a disk on-site if still needed. megaraid controller shows all
disks as good so we'll have to figure out the correct disk to replace.

Long tests finished, the only thing that I can see is that /dev/sdf (the one throwing I/O errors) has a bad sector in its defect list:

elukey@kafka1012:~$ for el in `df -h | grep spool | cut -d " " -f 1`; do echo $el; sudo smartctl -a $el | grep defect; done
/dev/sdc1
Elements in grown defect list: 0
/dev/sdj1
Elements in grown defect list: 0
/dev/sdd1
Elements in grown defect list: 0
/dev/sdi1
Elements in grown defect list: 0
/dev/sdh1
Elements in grown defect list: 0
/dev/sdk1
Elements in grown defect list: 0
/dev/sde1
Elements in grown defect list: 0
/dev/sdg1
Elements in grown defect list: 0
/dev/sdb3
Elements in grown defect list: 0
/dev/sda3
Elements in grown defect list: 0
/dev/sdf1
Elements in grown defect list: 1   <==================
/dev/sdl1
Elements in grown defect list: 0

Cmjohnson: are the "temperature" warnings only a noise? I am asking because I am ignorant, never seen them.

Ottomata: could it be possible that only one bad sector caused the whole mess? I am a bit puzzled.

@Cmjohnson: if the missing temperature warnings are fine from your point of view I would change the /dev/sdf drive since it is the only one showing a sign of degradation (the bad sector).

Thanks!!!

Status:

After failing to reboot with the swapped disk, @Cmjohnson and I tried to modify some RAID controller settings to get /dev/sdf to show with the new disk. We failed at this. We then put the old disk back in, booted, and then @Cmjohnson swapped the disk again. Will continue to work on this in the morning.

I don't fully follow the problem, but it seems the RAID controller needs some settings changed, and perhaps @apergos has done this before for swift, a long time ago.

Oook, this morning /dev/sdf is present and looks fine. Proceeding...

Change 268689 had a related patch set uploaded (by Ottomata):
Reenable kafka1012 broker

https://gerrit.wikimedia.org/r/268689

Change 268689 merged by Ottomata:
Reenable kafka1012 broker

https://gerrit.wikimedia.org/r/268689

Oook, looking good. broker is back up, it looks to be recovering.

Yes, looking good! Thank you!