Page MenuHomePhabricator

Update Zookeeper heap usage configuration and set alarms
Closed, ResolvedPublic3 Story Points

Description

While Moritz was restarting zookeeper on conf1001 for jvm security upgrades, I noticed a weird pattern in memory usage for the past 90 days:

It is also nice to notice the huge drop on heap usage after Moritz's restart of zookeeper on conf1001:

Last but not the least, I don't see any -Xmx JVM parameter, that should indicate a maximum heap size set by the JVM itself?

Details

Related Gerrit Patches:
operations/puppet : productionFix thresholds for Zookeeper Heap usage alarms
operations/puppet : productionFix Zookeeper's alarm for heap usage
operations/puppet : productionFix the Zookeeper JVM Heap usage alarm
operations/puppet : productionAdd experimental JVM Heap usage alarm to Zookeeper prod instances
operations/puppet : productionSet maximum JVM heap size for Zookeeper
operations/puppet : productionUpdate the zookeeper module
operations/puppet/zookeeper : masterAdd default JAVA_OPTS to the zookeeper server class

Event Timeline

elukey created this task.Feb 13 2017, 2:29 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 13 2017, 2:29 PM
elukey triaged this task as Medium priority.Feb 13 2017, 2:29 PM
elukey updated the task description. (Show Details)Feb 13 2017, 2:54 PM
elukey updated the task description. (Show Details)Feb 13 2017, 3:07 PM

Change 337413 had a related patch set uploaded (by Elukey):
Add default JAVA_OPTS to the zookeeper server class

https://gerrit.wikimedia.org/r/337413

Moritz completed the restarts and the Heap usage pattern changed on all the nodes, so this is probably something to expect with the current settings. I'd like to add the -Xmx option to the conf100[123] JVMs anyway, and maybe think about testing another Garbage Collector? The druid100[123] hosts are running another cluster of ZK, they might be good candidates for the tests.

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.
elukey removed subscribers: MoritzMuehlenhoff, Pchelolo, mobrovac.

Change 337413 merged by Ottomata:
Add default JAVA_OPTS to the zookeeper server class

https://gerrit.wikimedia.org/r/337413

Change 337792 had a related patch set uploaded (by Elukey):
Update the zookeeper module

https://gerrit.wikimedia.org/r/337792

Change 337792 merged by Elukey:
Update the zookeeper module

https://gerrit.wikimedia.org/r/337792

Change 337797 had a related patch set uploaded (by Elukey):
Set maximum JVM heap size for Zookeeper

https://gerrit.wikimedia.org/r/337797

elukey moved this task from Backlog to Analytics Backlog on the User-Elukey board.Feb 23 2017, 1:07 PM

Change 337797 merged by Elukey:
Set maximum JVM heap size for Zookeeper

https://gerrit.wikimedia.org/r/337797

Change 340719 had a related patch set uploaded (by Elukey):
[operations/puppet] Add experimental JVM Heap usage alarm to Zookeeper prod instances

https://gerrit.wikimedia.org/r/340719

Change 340719 merged by Elukey:
[operations/puppet] Add experimental JVM Heap usage alarm to Zookeeper prod instances

https://gerrit.wikimedia.org/r/340719

elukey moved this task from In Progress to Done on the Analytics-Kanban board.Mar 2 2017, 12:26 PM

Change 340743 had a related patch set uploaded (by Elukey):
[operations/puppet] Fix the Zookeeper JVM Heap usage alarm

https://gerrit.wikimedia.org/r/340743

Change 340743 merged by Elukey:
[operations/puppet] Fix the Zookeeper JVM Heap usage alarm

https://gerrit.wikimedia.org/r/340743

Nuria renamed this task from Zookeeper heap usage patterns to Update Zookeeper heap usage configuration and set alarms.Mar 8 2017, 8:01 PM
Nuria closed this task as Resolved.
Nuria set the point value for this task to 3.
Dzahn reopened this task as Open.Apr 14 2017, 12:29 AM
Dzahn added a subscriber: Dzahn.

re-opening since Icinga has many alerts:

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=Zookeeper+Node+JVM

some CRIT , some UNKNOWN "Zookeeper node JVM Heap usage"

the common issue that graphite checks are either "No valid datapoints found" or "100% of data above the critical threshold" but then end up not being actioable and just ACKed permanently.

They have a comment saying "wrong metric in the alarm".

Mentioned in SAL (#wikimedia-operations) [2017-04-14T09:43:34Z] <elukey> temporarily set sysctl -w net.ipv4.ip_local_port_range="15000 64000" on mw1306 (jobrunner) as test - (rollback: sysctl -w net.ipv4.ip_local_port_range="32768 60999") - T157968

Change 348206 had a related patch set uploaded (by Elukey):
[operations/puppet@production] Fix Zookeeper's alarm for heap usage

https://gerrit.wikimedia.org/r/348206

Change 348206 merged by Elukey:
[operations/puppet@production] Fix Zookeeper's alarm for heap usage

https://gerrit.wikimedia.org/r/348206

Multiple PEBKACs from my side:

  1. I acked permanently the alarms in Icinga without realizing it, I thought in my ignorance that it was only resetting the alarm. This lead to the false hope that all the monitors were fine (and problem 2).
  1. codfw metrics had eqiad hardcoded in them, generating the missing datapoints.

The main remaining issue now is that the alarms are in CRITICAL state even if theoretically they shouldn't be. Digging a bit more if I can find the last issue (another PEBKAC for sure).

Change 348214 had a related patch set uploaded (by Elukey):
[operations/puppet@production] Fix thresholds for Zookeeper Heap usage alarms

https://gerrit.wikimedia.org/r/348214

Change 348214 merged by Elukey:
[operations/puppet@production] Fix thresholds for Zookeeper Heap usage alarms

https://gerrit.wikimedia.org/r/348214

elukey closed this task as Resolved.Apr 14 2017, 1:45 PM

@Dzahn thanks a lot for the heads up, I should have fixed the issues. My ignorance about icinga lead me to false assumptions, I'll be more careful in the future.

I removed the acks from Icinga now, all green. Please re-open if anything is missing in your option!

Dzahn added a comment.Apr 14 2017, 5:41 PM

@elukey thank you for fixing :) They all look green now. I'll comment if i see them again.