Frequent/Long GC old gen collections for HDFS namenodes on an-master100[1,2]
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	elukey
	Jul 22 2019, 8:38 AM

Description

I noticed while debugging something else that the HDFS namenodes are showing frequent and log GC pauses since precise moments in time:

an-master1002 -> 2019-07-11T22:40 UTC
an-master1001 -> 2019-07-12T19:00 UTC

https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=1562792702280&to=1563035215585

There seems to be a clear cut between basically no old gen collections to sustained old gen collections. We are clearly missing monitors for this use case, but first we'd need to figure out what's happening. I tried to restart the namenode on an-master1002 (current standby) as test but the old gen collections didn't stop.

Details

Subject	Repo	Branch	Lines +/-
role::analytics_cluster::hadoop::master\|standby: increase NN heap size	operations/puppet	production	+4 -4
profile::hadoop::master/standby: add CMS old gen monitors	operations/puppet	production	+22 -0
role::analytics_cluster::hadoop::master/standby: tune CMS GC	operations/puppet	production	+2 -2

Customize query in gerrit

Event Timeline

elukey triaged this task as High priority.Jul 22 2019, 8:38 AM

elukey created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 22 2019, 8:38 AM

elukey renamed this task from Frequent/Long GC old gen collections for an-master100[1,2] to Frequent/Long GC old gen collections for HDFS namenodes on an-master100[1,2].Jul 22 2019, 8:38 AM

elukey updated the task description. (Show Details)

Some useful links:

https://yarn.wikimedia.org/cluster/apps/RUNNING doesn't show anything running since the 11/12th
https://tools.wmflabs.org/sal/analytics?p=0&q=&d=2019-07-11 shows some related restarts from Andrew, nothing for 12th.
https://tools.wmflabs.org/sal/production?p=0&q=&d=2019-07-11 shows a refinery deployment from Dan, that failed and Andrew took a bit of time to make it right (IIRC it was space consumption on some nodes).
IRC logs for both days https://bots.wmflabs.org/~wm-bot/browser/index.php?start=07%2F11%2F2019&end=07%2F12%2F2019&data=true&display=%23wikimedia-analytics

Found the explanation, and added a useful graph to the Hadoop dashboard:

https://grafana.wikimedia.org/d/000000585/hadoop?panelId=87&fullscreen&orgId=1&from=now-30d&to=now

In our GC settings for the CMS old gen we have -XX:CMSInitiatingOccupancyFraction=70, that is more or less the default suggestion that people give for this kind of Old gen GC. The above graph clearly shows that we have crossed, over time, the 70% mark and so the GC algorithm does what it is told to: run old gen collections until the occupancy percentage is less than 70%. This keeps happening over and over..

Given how the Namenodes work, I think it is safe to go up to 80%. When this threshold will be crossed, we'll probably need to think about expanding the heap size to something like 16G.

Change 524730 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::hadoop::master/standby: tune CMS GC

https://gerrit.wikimedia.org/r/524730

gerritbot added a project: Patch-For-Review.Jul 22 2019, 9:22 AM

Change 524730 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::hadoop::master/standby: tune CMS GC

https://gerrit.wikimedia.org/r/524730

Mentioned in SAL (#wikimedia-operations) [2019-07-22T09:32:32Z] <elukey> restart hadoop hdfs namenode on an-master1002 to apply new GC settings - T228620

Mentioned in SAL (#wikimedia-operations) [2019-07-22T09:47:58Z] <elukey> failover + restart of Hadoop HDFS namenode on an-master1001 to apply GC settings - T228620

Maintenance_bot removed a project: Patch-For-Review.Jul 22 2019, 10:10 AM

The issue seems fixed! I'd say that the last step is to create an alarm for https://grafana.wikimedia.org/d/000000585/hadoop?panelId=87&fullscreen&orgId=1&from=now-30d&to=now: if it crosses 80% we'd like to know, because at that stage the heap space we'll probably be no enough..

Created https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#HDFS_Namenode_Heap_settings

Change 524783 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::master/standby: add CMS old gen monitors

https://gerrit.wikimedia.org/r/524783

gerritbot added a project: Patch-For-Review.Jul 22 2019, 1:37 PM

elukey moved this task from Next Up to In Code Review on the Analytics-Kanban board.Jul 22 2019, 1:42 PM

Ottomata awarded a token.Jul 22 2019, 1:49 PM

Change 524783 merged by Elukey:
[operations/puppet@production] profile::hadoop::master/standby: add CMS old gen monitors

https://gerrit.wikimedia.org/r/524783

• fdans moved this task from Incoming to Operational Excellence on the Analytics board.Jul 22 2019, 4:04 PM

Maintenance_bot removed a project: Patch-For-Review.Jul 22 2019, 4:10 PM

elukey changed the point value for this task from 0 to 5.Jul 23 2019, 9:37 AM

elukey moved this task from In Code Review to Done on the Analytics-Kanban board.

Change 526613 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::hadoop::master|standby: increase NN heap size

https://gerrit.wikimedia.org/r/526613

Change 526613 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::hadoop::master|standby: increase NN heap size

https://gerrit.wikimedia.org/r/526613

Maintenance_bot removed a project: Patch-For-Review.Jul 31 2019, 8:10 AM

• Nuria closed this task as Resolved.Jul 31 2019, 7:29 PM

Frequent/Long GC old gen collections for HDFS namenodes on an-master100[1,2]Closed, ResolvedPublic5 Estimated Story PointsActions

Description

Details

Event Timeline

Frequent/Long GC old gen collections for HDFS namenodes on an-master100[1,2]
Closed, ResolvedPublic5 Estimated Story Points
Actions