I noticed while debugging something else that the HDFS namenodes are showing frequent and log GC pauses since precise moments in time:
- an-master1002 -> 2019-07-11T22:40 UTC
- an-master1001 -> 2019-07-12T19:00 UTC
https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=1562792702280&to=1563035215585
There seems to be a clear cut between basically no old gen collections to sustained old gen collections. We are clearly missing monitors for this use case, but first we'd need to figure out what's happening. I tried to restart the namenode on an-master1002 (current standby) as test but the old gen collections didn't stop.