Analytics1028 hdfs daemon died because of disk errors
Closed, ResolvedPublic

Description

On Analytics1028:

[Sun Mar  5 12:12:39 2017] EXT4-fs warning (device sdj1): ext4_end_bio:317: I/O error -5 writing to inode 205259304 (offset 3952640 size 61440 starting block 650475380)
[Sun Mar  5 12:12:39 2017] Buffer I/O error on device sdj1, logical block 650475108
[Sun Mar  5 12:12:39 2017] Buffer I/O error on device sdj1, logical block 650475109
[Sun Mar  5 12:12:39 2017] Buffer I/O error on device sdj1, logical block 650475110
[Sun Mar  5 12:12:39 2017] Buffer I/O error on device sdj1, logical block 650475111
[Sun Mar  5 12:12:39 2017] Buffer I/O error on device sdj1, logical block 650475112
[Sun Mar  5 12:12:39 2017] Buffer I/O error on device sdj1, logical block 650475113
[Sun Mar  5 12:12:39 2017] Buffer I/O error on device sdj1, logical block 650475114
[Sun Mar  5 12:12:39 2017] Buffer I/O error on device sdj1, logical block 650475115
[Sun Mar  5 12:12:39 2017] Buffer I/O error on device sdj1, logical block 650475116
[Sun Mar  5 12:12:39 2017] Buffer I/O error on device sdj1, logical block 650475117

This host is a bit delicate since it is a journal node (only three on the entire Hadoop cluster) so before doing maintenance on it we'd need to coordinate to triple check that the other two hosts are up and running (to avoid things like Hadoop HDFS Master shutdown).

elukey created this task.Mar 5 2017, 10:17 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 5 2017, 10:17 AM

Mentioned in SAL (#wikimedia-operations) [2017-03-05T10:19:06Z] <elukey> disabled puppet on analytics1028 to avoid puppet to start the HDFS daemon (T159632)

I also stopped the Yarn node manager but not the journalnode, will probably move it to analytics1029 tomorrow.

Mentioned in SAL (#wikimedia-operations) [2017-03-06T10:24:45Z] <elukey> (shamefully) replaced /etc/init.d/hadoop-hdfs-datanode script with "exit 0" to prevent the HDFS datanode daemon to start on analytics1028 (broken disk) and leave the rest running (puppet included) - T159632

Nuria moved this task from Incoming to Radar on the Analytics board.Mar 6 2017, 4:29 PM

@Cmjohnson can we take care of this this week?

Swapped the disk out with a spare on-site. The server is still under warranty so requested a new disk to be sent through Dell.

Confirmed: Request 946033545 was successfully submitted.

Mentioned in SAL (#wikimedia-analytics) [2017-03-28T14:30:02Z] <elukey> analytics1028 back serving traffic - T159632

elukey closed this task as Resolved.Mar 28 2017, 2:31 PM
elukey claimed this task.