Page MenuHomePhabricator

PuppetDB misbehaving on 2017-07-15
Open, LowPublic

Description

@elukey found puppetdb misbehaving today with puppet runs failing throughout the fleet. Fist non INFO level log for this day is:

2017-07-15 08:32:38,151 WARN  [o.e.j.s.HttpChannel] /v3/commands?checksum=6fec653112b6b1faf7a57b53d1876afbb1961be2
java.lang.OutOfMemoryError: Java heap space

followed by a ton of

java.lang.OutOfMemoryError: Java heap space

Puppetdb was running since Jun 29 and at the time of debugging has a VIRT of 10052304 (~10G) and a RSS of 4568496 (>4G). The default max heapsize is 3.9G

java -XX:+PrintFlagsFinal -version |grep MaxHeapSize
    uintx MaxHeapSize                              := 4208984064                          {product}

We should increase via Xmx the Max Heap Size for the JVM.

Despite config to have the jvm killed on an OutOfMemoryError this did not work due to

Jul 15 08:28:21 nitrogen puppetdb[576]: #   Executing /bin/sh -c ""kill -9 puppetdb" -Xmx -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/
Jul 15 08:28:21 nitrogen puppetdb[576]: sh: 1: kill -9 puppetdb: not found

Seems like Java 8u92 added 2 new settings for this as well. -XX:+ExitOnOutOfMemoryError and
-XX:+CrashOnOutOfMemoryError. We should investigate this as well as we have 8u131

I 've restarted puppetdb manually and CPU and memory usage dropped back to normal

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 15 2017, 10:26 AM
akosiaris triaged this task as High priority.Jul 15 2017, 10:27 AM
akosiaris added a subscriber: Joe.

Mentioned in SAL (#wikimedia-operations) [2017-07-15T10:42:05Z] <elukey> puppetdb restarted on nitrogen, puppet agents re-enabled - T170740

Change 366229 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] puppetdb: Bump Java Heap max size to 6GB

https://gerrit.wikimedia.org/r/366229

It would be great in my opinion to monitor JVM metrics before and after the change that Alex proposed:

elukey@nitrogen:~$ curl localhost:8080/v4/metrics/mbean/java.lang:type=Memory
{
  "Verbose" : false,
  "ObjectPendingFinalizationCount" : 0,
  "HeapMemoryUsage" : {
    "committed" : 4136108032,
    "init" : 264241152,
    "max" : 4136108032,
    "used" : 3385310192
  },
  "NonHeapMemoryUsage" : {
    "committed" : 186138624,
    "init" : 2555904,
    "max" : -1,
    "used" : 137232872
  },
  "ObjectName" : "java.lang:type=Memory"
}

elukey@nitrogen:~$ curl localhost:8080/v4/metrics/mbean/java.lang:type=Threading
{
  "ThreadCount" : 74,
  "AllThreadIds" : "[J@67d6c19e",
  "ObjectMonitorUsageSupported" : true,
  "PeakThreadCount" : 91,
  "ThreadAllocatedMemoryEnabled" : true,
  "ThreadContentionMonitoringSupported" : true,
  "DaemonThreadCount" : 32,
  "CurrentThreadCpuTime" : 162647820,
  "ObjectName" : "java.lang:type=Threading",
  "ThreadCpuTimeEnabled" : true,
  "ThreadCpuTimeSupported" : true,
  "SynchronizerUsageSupported" : true,
  "TotalStartedThreadCount" : 18075,
  "ThreadAllocatedMemorySupported" : true,
  "ThreadContentionMonitoringEnabled" : false,
  "CurrentThreadUserTime" : 140000000,
  "CurrentThreadCpuTimeSupported" : true
}

Not sure if already tracked in another task, in case apologies :)

The Head used fluctuates between the 3.3-3.8 GBs, probably the GC is constantly reclaiming memory in small runs. At the time of the outage the logs are showing also java.lang.OutOfMemoryError: GC overhead limit exceeded, so probably this outage was caused by HeapMemoryUsage crossing a critical threshold that leads to trashing?

Checked the patch that Alex wrote and realized that we don't use the heap_size parameter, ending up in a empty Xmx:

puppetdb 30718 54.4 27.6 9986036 4547056 ?     Ssl  Jul15 3858:51 /usr/bin/java -XX:OnOutOfMemoryError="kill -9 puppetdb" -Xmx -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/puppetdb/puppetdb-oom.hprof -Djava.security.egd=file:/dev/urandom -cp /usr/share/puppetdb/puppetdb.jar clojure.main -m com.puppetlabs.puppetdb.core services -c /etc/puppetdb/conf.d

It shouldn't be an issue but better to remove this inconsistency :)

So we had a small hiccup today in which puppetdb responded 28 times 503s between 16:20:13 and 16:20:39 UTC, of those 17 where POSTs to update the hosts facts and we had a bit of a failure spam on IRC. It recovered by itself.

After some investigation with @BBlack and @fgiunchedi we found that:

  1. The JAVA_ARGS variable that is used in ExecStart in the systemd unit at modules/puppetmaster/templates/initscripts/puppetdb.systemd.erb seems to replace it as a single command line option, hence basically failing to set all those options
  2. The heap_size variable doesn't seem to be set anywhere in the puppet tree, and it generates an invalid empty -Xmx option

Under those assumptions is perfectly possible that PuppetDB had reached a memory limit, seems that the default inherited limit is 4GB and it's using slightly more that 4GB of RES. Also without those options in place we don't have the out of memory error restart and heap dump.
And we don't have the GC log setup.

It might be worth looking at the EnvironmentFile systemd option too.

So, things fell over again with a ton of puppetfail spam. As a stopgap, I've done the following:

  1. Disabled the agent on nitrogen
  2. Edited the puppetdb.service systemd unit file manually, so that the relevant part is now literal and avoids problem (1) in the previous comment (and left off the previous oomkill command, because the quoting was causing issues there), and also added a real 6G heap limit, so the ExecStart= now temporarily reads:
ExecStart=/usr/bin/java -Xmx6G -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/puppetdb/puppetdb-oom.hprof -Djava.security.egd=file:/dev/urandom -cp /usr/share/puppetdb/puppetdb.jar clojure.main -m com.puppetlabs.puppetdb.core services -c /etc/puppetdb/conf.d
  1. Restarted the puppetdb service with this new unit file, and it seems to be working so far...

Mentioned in SAL (#wikimedia-operations) [2017-07-27T10:14:13Z] <_joe_> restarting puppetdb on nihal, T170740

Change 366229 merged by Giuseppe Lavagetto:
[operations/puppet@production] puppetdb: Bump Java Heap max size to 6GB

https://gerrit.wikimedia.org/r/366229

akosiaris lowered the priority of this task from High to Low.Aug 17 2017, 9:10 AM

I am lowering priority on this one. The one interesting thing left to do is to create a parser for the /metrics endpoint @elukey pointed out above and export those metrics to prometheus/graphite

Mentioned in SAL (#wikimedia-operations) [2018-01-18T11:06:27Z] <volans> disabled puppet on tegmen to test impact on puppetdb - T170740

The puppetdb grafana dashboard (and its related monitoring config for nitrogen/nihal) were added in https://phabricator.wikimedia.org/T184796

We had OOMs also with puppet disabled on tegmen, so that's not the culprit.