Page MenuHomePhabricator

Disk sde likely failing on analytics1032
Closed, ResolvedPublic

Description

Got an alert of a downed datanode on analytics1032.

Datanode logs says:

org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 11, volumes configured: 12, volumes failed: 1, volume failures tolerated: 0

syslog says:

Sep  9 02:27:20 analytics1032 kernel: [3584610.310985] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Sep  9 02:27:20 analytics1032 kernel: [3584610.310987] sd 0:2:4:0: [sde] CDB:
...

Event Timeline

Ottomata created this task.Sep 9 2016, 2:37 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 9 2016, 2:37 AM

Also, megacli shows:

$sudo megacli -PDList -aAll 

...
Enclosure Device ID: 32
Slot Number: 3
...
Firmware state: Failed

Mentioned in SAL [2016-09-09T07:17:10Z] <elukey> puppet disabled on analytics1032, Hadoop services stopped - T145170

elukey added a comment.Sep 9 2016, 8:55 AM

kern.log, syslog and jmxtrans kept getting errors logged ending up filling the disks, the major cause seemed to be a "du" process launched by the "hdfs" user. I killed it and everything stopped.

The host is now "quiet", now we'd need to check the disk and probably swap it :)

The disk on analytics1032 is failed. Replaced the failed disk, cleared the cache, added the disk back and all disks are back online

@analytics1032:~# megacli -PDList -aALL |grep "Firmware state"
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

elukey closed this task as Resolved.Sep 9 2016, 1:26 PM

created the partition, rebooted the host since we don't have UUIDs and enabled puppet. All good!

Thanks @Cmjohnson!

I used an on-site spare to swap the disk, ordered a new one from Dell.

Congratulations: Work Order SR935921121 was successfully submitted.

Thanks you two! So much action between my bedtime and my coffee! :D :D