It seems that the following happens on an-mater1001 when it fails over to an-master1002:
2018-10-12 06:05:00,758 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 9887ms for sessionid 0x50664009c3b100f0, closing socket connect ion and attempting reconnect 2018-10-12 06:05:00,761 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 8445ms No GCs detected 2018-10-12 06:05:00,868 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.ap ache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2018-10-12 06:05:00,868 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session disconnected 2018-10-12 06:05:00,868 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode... 2018-10-12 06:05:00,869 WARN org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService: Lost contact with Zookeeper. Transitioning to standby in 10000 ms if connection is not reestablished.
This seems what happened a while ago with the Hadoop HDFS namenode, that forced us to change the GC used. In this case I don't see a clear stress pattern in the GC timings graphs, but it needs a more careful review.