Page MenuHomePhabricator

JVM pauses cause Yarn master to failover
Closed, ResolvedPublic5 Estimated Story Points

Description

It seems that the following happens on an-mater1001 when it fails over to an-master1002:

2018-10-12 06:05:00,758 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 9887ms for sessionid 0x50664009c3b100f0, closing socket connect
ion and attempting reconnect
2018-10-12 06:05:00,761 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 8445ms
No GCs detected
2018-10-12 06:05:00,868 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.ap
ache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2018-10-12 06:05:00,868 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session disconnected
2018-10-12 06:05:00,868 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
2018-10-12 06:05:00,869 WARN org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService: Lost contact with Zookeeper. Transitioning to standby in 10000 ms if connection is
 not reestablished.

This seems what happened a while ago with the Hadoop HDFS namenode, that forced us to change the GC used. In this case I don't see a clear stress pattern in the GC timings graphs, but it needs a more careful review.

Event Timeline

elukey triaged this task as High priority.Oct 14 2018, 8:54 AM
elukey created this task.

Mentioned in SAL (#wikimedia-operations) [2018-10-14T08:54:45Z] <elukey> restart Yarn resource manager on an-master1002 to force an-master1001 to take the leadership back - T206943

Change 467286 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Raise the Yarn Hadoop Resource Manager heap min/max to 4G

https://gerrit.wikimedia.org/r/467286

Change 467286 merged by Elukey:
[operations/puppet@production] Raise the Yarn Hadoop Resource Manager heap min/max to 4G

https://gerrit.wikimedia.org/r/467286

I was about to send another code change for the GC, but then I took a look again to the logs in the description and realized that I've missed an important bit:

2018-10-12 06:05:00,761 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 8445ms
No GCs detected

That makes finally sense with what I have observed, namely no sign of stress in the GC logs. This means that the JVM was paused for ~8s by something else, maybe the OS?

As reference, this is what https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/util/JvmPauseMonitor.html does:

Class which sets up a simple thread which runs in a loop sleeping for a short interval of time. If the sleep takes significantly longer than its target time, it implies that the JVM or host machine has paused processing, which may cause other problems. If such a pause is detected, the thread logs a message.

It happened also on the 16th, but didn't lead to any failover:

2018-10-16 22:24:26,205 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Reserved container container_e91_153959409329
1_4925_01_049760 on node host: analytics1057.eqiad.wmnet:8041 #containers=21 available=1136 used=51200 for application application_1539594093291_4925
2018-10-16 22:24:26,207 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 8683ms for sessionid 0x50664009c3b
1017d, closing socket connection and attempting reconnect
2018-10-16 22:24:26,207 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 8947ms for sessionid 0x5166400a8ea
c013c, closing socket connection and attempting reconnect
2018-10-16 22:24:26,226 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 7225ms
No GCs detected

As reference, https://www.slideshare.net/HadoopSummit/operating-and-supporting-apache-hbase-best-practices-and-improvements (slide 15) shows a similar problem on HBase, that was related to the kernel driver for the disk controller.

Mentioned in SAL (#wikimedia-operations) [2018-10-28T17:30:44Z] <elukey> restart yarn resource manager on an-master1002 to force failover to an-master1001 - T206943

Happened again today:

2018-10-28 11:26:45,938 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 11085ms for sessionid 0x50664009c3b1017d, closing so
cket connection and attempting reconnect
2018-10-28 11:26:45,939 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 8072ms
No GCs detected
2018-10-28 11:26:45,939 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 11450ms for sessionid 0x5166400a8eac013c, closing so
cket connection and attempting reconnect
2018-10-28 11:26:46,039 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
2018-10-28 11:26:46,039 WARN org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService: Lost contact with Zookeeper. Transitioning to standby in 10000 ms if connection is not reestablished.
2018-10-28 11:26:46,040 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLos

It seems happening also the HDFS namenode, but not at the same time:

elukey@an-master1001:~$  sudo grep -rni "org.apache.hadoop.util.JvmPauseMonitor" /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master1001.log -A 1
137384:2018-10-28 14:27:30,927 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 6504ms
137385-No GCs detected
--
351649:2018-10-28 17:28:11,956 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 5908ms
351650-No GCs detected

Logs for the minute 11:26 on zookeeper nodes:

conf1004:

elukey@conf1004:~$ sudo grep "11:26" /var/log/zookeeper/zookeeper.log
2018-10-28 11:26:45,939 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
2018-10-28 11:26:45,939 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1033] - Closed socket connection for client /2620:0:861:104:10:64:5:26:35288 which had sessionid 0x50664009c3b1017d
2018-10-28 11:26:46,000 - INFO  [SessionTracker:ZooKeeperServer@358] - Expiring session 0x5166400a8eac013c, timeout of 10000ms exceeded
2018-10-28 11:26:46,001 - INFO  [SessionTracker:ZooKeeperServer@358] - Expiring session 0x50664009c3b1017d, timeout of 10000ms exceeded
2018-10-28 11:26:46,001 - INFO  [ProcessThread(sid:1104 cport:-1)::PrepRequestProcessor@487] - Processed session termination for sessionid: 0x5166400a8eac013c
2018-10-28 11:26:46,001 - INFO  [ProcessThread(sid:1104 cport:-1)::PrepRequestProcessor@487] - Processed session termination for sessionid: 0x50664009c3b1017d
2018-10-28 11:26:46,040 - INFO  [ProcessThread(sid:1104 cport:-1)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x5166400a8eac0365 type:create cxid:0x1 zxid:0x30000cd7a7 txntype:-1 reqpath:n/a Error Path:/rmstore Error:KeeperErrorCode = NodeExists for /rmstore
2018-10-28 11:26:46,042 - INFO  [ProcessThread(sid:1104 cport:-1)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x5166400a8eac0365 type:create cxid:0x2 zxid:0x30000cd7a8 txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot
2018-10-28 11:26:46,053 - INFO  [ProcessThread(sid:1104 cport:-1)::PrepRequestProcessor@596] - Got user-level KeeperException when processing sessionid:0x5166400a8eac0365 type:multi cxid:0x4 zxid:0x30000cd7aa txntype:-1 reqpath:n/a aborting remaining multi ops. Error Path:/rmstore/ZKRMStateRoot/RM_ZK_FENCING_LOCK Error:KeeperErrorCode = NoNode for /rmstore/ZKRMStateRoot/RM_ZK_FENCING_LOCK
2018-10-28 11:26:46,057 - INFO  [ProcessThread(sid:1104 cport:-1)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x5166400a8eac0365 type:create cxid:0x5 zxid:0x30000cd7ab txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/RMAppRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/RMAppRoot
2018-10-28 11:26:46,059 - INFO  [ProcessThread(sid:1104 cport:-1)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x5166400a8eac0365 type:create cxid:0x6 zxid:0x30000cd7ac txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/RMDTSecretManagerRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/RMDTSecretManagerRoot
2018-10-28 11:26:46,061 - INFO  [ProcessThread(sid:1104 cport:-1)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x5166400a8eac0365 type:create cxid:0x7 zxid:0x30000cd7ad txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTMasterKeysRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTMasterKeysRoot
2018-10-28 11:26:46,063 - INFO  [ProcessThread(sid:1104 cport:-1)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x5166400a8eac0365 type:create cxid:0x8 zxid:0x30000cd7ae txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDelegationTokensRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDelegationTokensRoot
2018-10-28 11:26:46,065 - INFO  [ProcessThread(sid:1104 cport:-1)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x5166400a8eac0365 type:create cxid:0x9 zxid:0x30000cd7af txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTSequentialNumber Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTSequentialNumber
2018-10-28 11:26:46,067 - INFO  [ProcessThread(sid:1104 cport:-1)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x5166400a8eac0365 type:create cxid:0xa zxid:0x30000cd7b0 txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/AMRMTokenSecretManagerRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/AMRMTokenSecretManagerRoot
2018-10-28 11:26:46,116 - INFO  [ProcessThread(sid:1104 cport:-1)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x52664009c28503d5 type:create cxid:0x1 zxid:0x30000cd7b4 txntype:-1 reqpath:n/a Error Path:/yarn-leader-election/analytics-hadoop/ActiveStandbyElectorLock Error:KeeperErrorCode = NodeExists for /yarn-leader-election/analytics-hadoop/ActiveStandbyElectorLock
elukey@conf1004:~$ dig an-master1001.eqiad.wmnet

conf1005:

elukey@conf1005:~$ grep "2018-10-28 11:26" /var/log/zookeeper/zookeeper.log
2018-10-28 11:26:45,939 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
2018-10-28 11:26:45,940 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1033] - Closed socket connection for client /10.64.5.26:59198 which had sessionid 0x5166400a8eac013c
2018-10-28 11:26:46,035 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /10.64.21.110:55026
2018-10-28 11:26:46,037 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@928] - Client attempting to establish new session at /10.64.21.110:55026
2018-10-28 11:26:46,038 - INFO  [CommitProcessor:1105:ZooKeeperServer@673] - Established session 0x5166400a8eac0365 with negotiated timeout 10000 for client /10.64.21.110:55026
2018-10-28 11:26:46,038 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@952] - got auth packet /10.64.21.110:55026
2018-10-28 11:26:46,039 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@986] - auth success /10.64.21.110:55026
2018-10-28 11:26:46,039 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@952] - got auth packet /10.64.21.110:55026
2018-10-28 11:26:46,039 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@986] - auth success /10.64.21.110:55026
2018-10-28 11:26:46,101 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /2620:0:861:104:10:64:5:26:52674
2018-10-28 11:26:46,101 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@921] - Client attempting to renew session 0x50664009c3b1017d at /2620:0:861:104:10:64:5:26:52674
2018-10-28 11:26:46,101 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 0x50664009c3b1017d
2018-10-28 11:26:46,103 - INFO  [QuorumPeer[myid=1105]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@666] - Invalid session 0x50664009c3b1017d for client /2620:0:861:104:10:64:5:26:52674, probably expired
2018-10-28 11:26:46,103 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1033] - Closed socket connection for client /2620:0:861:104:10:64:5:26:52674 which had sessionid 0x50664009c3b1017d

conf1006:

elukey@conf1006:~$ grep "2018-10-28 11:26:" /var/log/zookeeper/zookeeper.log
2018-10-28 11:26:46,110 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /2620:0:861:104:10:64:5:26:51566
2018-10-28 11:26:46,111 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@928] - Client attempting to establish new session at /2620:0:861:104:10:64:5:26:51566
2018-10-28 11:26:46,112 - INFO  [CommitProcessor:1106:ZooKeeperServer@673] - Established session 0x52664009c28503d5 with negotiated timeout 10000 for client /2620:0:861:104:10:64:5:26:51566

My impression is that the errors registered are due to the failover workflow, since they happens right after the RM jvm stalls..

Change 470327 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] Allow tuning of the Yarn RM's zookeeper session timeout

https://gerrit.wikimedia.org/r/470327

Change 470327 merged by Elukey:
[operations/puppet/cdh@master] Allow tuning of the Yarn RM's zookeeper session timeout

https://gerrit.wikimedia.org/r/470327

Change 470336 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::common: raise Yarn zookeeper timeout to 20s

https://gerrit.wikimedia.org/r/470336

Change 470336 merged by Elukey:
[operations/puppet@production] profile::hadoop::common: raise Yarn zookeeper timeout to 20s

https://gerrit.wikimedia.org/r/470336

Mentioned in SAL (#wikimedia-operations) [2018-10-29T08:56:46Z] <elukey> restart yarn on an-master100[1,2] to pick up new zookeeper timeout settings (10s -> 20s) - T206943

Since yesterday I have been running sar -d -p2 on an-master1001, and this is interesting:

elukey@an-master1001:~$  sudo grep -rni "org.apache.hadoop.util.JvmPauseMonitor" /var/log/hadoop-yarn/yarn-yarn-resourcemanager-an-master1001.log.2 -A 1 
2018-10-30 01:37:39,872 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC):  pause of approximately 1894ms

01:37:27 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
01:37:29 AM       sda      5.50      0.00     69.00     12.55      0.00      0.00      0.00      0.00
01:37:29 AM       sdb      5.50      0.00     69.00     12.55      0.00      0.00      0.00      0.00
01:37:29 AM       md1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:37:29 AM       md0      4.50      0.00     68.00     15.11      0.00      0.00      0.00      0.00
01:37:29 AM       md2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:37:29 AM an--master1001--vg-lvol0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

01:37:29 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
01:37:31 AM       sda    136.00      8.00 242443.50   1782.73     66.40    338.88      3.59     48.80
01:37:31 AM       sdb    130.00      0.00 232203.50   1786.18     69.10    352.68      3.75     48.80
01:37:31 AM       md1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:37:31 AM       md0      0.50      0.00     32.00     64.00      0.00      0.00      0.00      0.00
01:37:31 AM       md2   1043.00      0.00 2100992.00   2014.37      0.00      0.00      0.00      0.00
01:37:31 AM an--master1001--vg-lvol0   1026.50      0.00 2100992.00   2046.75   1070.98     78.25      0.60     61.80

01:37:31 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
01:37:33 AM       sda    225.00      4.50 458752.00   2038.92    135.62    594.00      4.44    100.00
01:37:33 AM       sdb    235.50      0.00 482304.00   2048.00    126.60    557.45      4.25    100.00
01:37:33 AM       md1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:37:33 AM       md0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:37:33 AM       md2     12.00      0.00     96.00      8.00      0.00      0.00      0.00      0.00
01:37:33 AM an--master1001--vg-lvol0     12.00      0.00     96.00      8.00   1614.60  40974.17     83.33    100.00

01:37:33 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
01:37:35 AM       sda    242.50      0.00 496640.00   2048.00    135.46    557.40      4.12    100.00
01:37:35 AM       sdb    229.50      0.00 470016.00   2048.00    145.56    600.20      4.36    100.00
01:37:35 AM       md1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:37:35 AM       md0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:37:35 AM       md2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:37:35 AM an--master1001--vg-lvol0      0.00      0.00      0.00      0.00   1156.06      0.00      0.00    100.00

01:37:35 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
01:37:37 AM       sda    239.00      0.00 487425.50   2039.44    135.18    577.51      4.18    100.00
01:37:37 AM       sdb    221.00      0.00 450561.50   2038.74    163.91    703.40      4.52    100.00
01:37:37 AM       md1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:37:37 AM       md0      8.00      0.00     64.00      8.00      0.00      0.00      0.00      0.00
01:37:37 AM       md2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:37:37 AM an--master1001--vg-lvol0      0.00      0.00      0.00      0.00    699.26      0.00      0.00    100.00

01:37:37 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
01:37:39 AM       sda    206.00      0.00 415928.00   2019.07     70.82    439.56      4.38     90.20
01:37:39 AM       sdb    229.00      0.00 464896.00   2030.11    117.12    650.58      4.37    100.00
01:37:39 AM       md1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:37:39 AM       md0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:37:39 AM       md2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:37:39 AM an--master1001--vg-lvol0      0.00      0.00      0.00      0.00    253.48      0.00      0.00    100.00

01:37:39 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
01:37:41 AM       sda     45.50     12.50    747.00     16.69      0.01      0.31      0.31      1.40
01:37:41 AM       sdb     45.00      0.00   1955.00     43.44      0.02     66.67      0.36      1.60
01:37:41 AM       md1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:37:41 AM       md0      9.00      0.00    408.00     45.33      0.00      0.00      0.00      0.00
01:37:41 AM       md2     74.00      0.00    332.00      4.49      0.00      0.00      0.00      0.00
01:37:41 AM an--master1001--vg-lvol0     73.00      0.00    332.00      4.55      0.07   1415.89      0.22      1.60

01:37:41 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
01:37:43 AM       sda      2.50      0.00      7.00      2.80      0.00      0.00      0.00      0.00
01:37:43 AM       sdb      2.50      0.00      7.00      2.80      0.00      0.00      0.00      0.00
01:37:43 AM       md1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:37:43 AM       md0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:37:43 AM       md2      1.00      0.00      4.00      4.00      0.00      0.00      0.00      0.00
01:37:43 AM an--master1001--vg-lvol0      1.00      0.00      4.00      4.00      0.00      0.00      0.00      0.00

01:37:43 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
01:37:45 AM       sda      5.50      0.00    245.00     44.55      0.00      0.36      0.36      0.20
01:37:45 AM       sdb      5.50      0.00    245.00     44.55      0.00      0.00      0.00      0.00
01:37:45 AM       md1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:37:45 AM       md0      4.00      0.00     32.00      8.00      0.00      0.00      0.00      0.00
01:37:45 AM       md2     26.50      0.00    208.00      7.85      0.00      0.00      0.00      0.00
01:37:45 AM an--master1001--vg-lvol0     26.00      0.00    208.00      8.00      0.00      0.08      0.08      0.20

So the stall seems to be caused by something doing heavy I/O and probably saturating the disk controllers. I'll try to use pidstat to figure out if we can track it down to a specific process.

elukey@an-master1001:~$  sudo grep -rni "JvmPauseMonitor" /var/log/hadoop-yarn/yarn-yarn-resourcemanager-an-master1001.log* -A 1
/var/log/hadoop-yarn/yarn-yarn-resourcemanager-an-master1001.log:388219:2018-11-03 04:46:41,237 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3241ms
/var/log/hadoop-yarn/yarn-yarn-resourcemanager-an-master1001.log-388220-No GCs detected
04:46:38 AM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
04:46:43 AM     0         1      0.00      0.82      0.00       0  /sbin/init
04:46:43 AM     0       512      0.00     23.00      0.00      43  jbd2/md0-8
04:46:43 AM     0       720      0.00     12.32      0.00     274  jbd2/dm-0-8
04:46:43 AM   116      1162      0.00  20602.87      0.00       0  /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Dproc_namenode -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-hdfs -Dhadoop.log.fil
04:46:43 AM     0     23213      0.00      0.00      0.00     223  kworker/u513:0
04:46:43 AM     0     37067      0.00      0.00      0.00      43  kworker/u513:2
04:46:43 AM     0     85555      0.00      0.82      0.00       0  /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d
04:46:43 AM   114    173625      0.00      3.29      0.00       0  /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Dproc_resourcemanager -Xmx2048m -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.lo
04:46:43 AM     0    217466      0.00      0.82      0.00     285  pidstat -dl 2

The pidstat snapshot is clear: kworker (a placeholder for kernel's IO activity) show iodelay, together with jbd2 (journaling), exactly when the Hadoop Namenode writes like crazy for a certain amount of time.

elukey lowered the priority of this task from High to Low.Nov 20 2018, 1:43 PM

We bumped the Xmx/Xms settings of the HDFS namenode to 12G (was 8G) for unrelated changes and I haven't seen any more pauses since then. The increased zookeeper timeout has also helped. For the moment I don't see any more action items, so I think we are good to close.

elukey set the point value for this task to 5.