Page MenuHomePhabricator

Frequent/Long GC old gen collections for HDFS namenodes on an-master100[1,2]
Closed, ResolvedPublic5 Estimated Story Points

Description

I noticed while debugging something else that the HDFS namenodes are showing frequent and log GC pauses since precise moments in time:

  • an-master1002 -> 2019-07-11T22:40 UTC
  • an-master1001 -> 2019-07-12T19:00 UTC

https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=1562792702280&to=1563035215585

There seems to be a clear cut between basically no old gen collections to sustained old gen collections. We are clearly missing monitors for this use case, but first we'd need to figure out what's happening. I tried to restart the namenode on an-master1002 (current standby) as test but the old gen collections didn't stop.

Event Timeline

elukey triaged this task as High priority.Jul 22 2019, 8:38 AM
elukey created this task.
elukey renamed this task from Frequent/Long GC old gen collections for an-master100[1,2] to Frequent/Long GC old gen collections for HDFS namenodes on an-master100[1,2].Jul 22 2019, 8:38 AM
elukey updated the task description. (Show Details)

Some useful links:

Found the explanation, and added a useful graph to the Hadoop dashboard:

https://grafana.wikimedia.org/d/000000585/hadoop?panelId=87&fullscreen&orgId=1&from=now-30d&to=now

In our GC settings for the CMS old gen we have -XX:CMSInitiatingOccupancyFraction=70, that is more or less the default suggestion that people give for this kind of Old gen GC. The above graph clearly shows that we have crossed, over time, the 70% mark and so the GC algorithm does what it is told to: run old gen collections until the occupancy percentage is less than 70%. This keeps happening over and over..

Given how the Namenodes work, I think it is safe to go up to 80%. When this threshold will be crossed, we'll probably need to think about expanding the heap size to something like 16G.

Change 524730 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::hadoop::master/standby: tune CMS GC

https://gerrit.wikimedia.org/r/524730

Change 524730 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::hadoop::master/standby: tune CMS GC

https://gerrit.wikimedia.org/r/524730

Mentioned in SAL (#wikimedia-operations) [2019-07-22T09:32:32Z] <elukey> restart hadoop hdfs namenode on an-master1002 to apply new GC settings - T228620

Mentioned in SAL (#wikimedia-operations) [2019-07-22T09:47:58Z] <elukey> failover + restart of Hadoop HDFS namenode on an-master1001 to apply GC settings - T228620

The issue seems fixed! I'd say that the last step is to create an alarm for https://grafana.wikimedia.org/d/000000585/hadoop?panelId=87&fullscreen&orgId=1&from=now-30d&to=now: if it crosses 80% we'd like to know, because at that stage the heap space we'll probably be no enough..

Change 524783 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::master/standby: add CMS old gen monitors

https://gerrit.wikimedia.org/r/524783

Change 524783 merged by Elukey:
[operations/puppet@production] profile::hadoop::master/standby: add CMS old gen monitors

https://gerrit.wikimedia.org/r/524783

elukey changed the point value for this task from 0 to 5.Jul 23 2019, 9:37 AM
elukey moved this task from In Code Review to Done on the Analytics-Kanban board.

Change 526613 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::hadoop::master|standby: increase NN heap size

https://gerrit.wikimedia.org/r/526613

Change 526613 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::hadoop::master|standby: increase NN heap size

https://gerrit.wikimedia.org/r/526613