Page MenuHomePhabricator

puppetdb on deployment-puppetdb03 keeps getting OOMKilled
Open, Needs TriagePublic

Description

Following the replacement of puppetdb02 (stretch, old version of puppetdb) with a buster one in T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster)

# service puppetdb status
● puppetdb.service - Puppet data warehouse server
   Loaded: loaded (/lib/systemd/system/puppetdb.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Wed 2020-03-18 12:15:11 UTC; 11h ago
     Docs: man:puppetdb(8)
           file:/usr/share/doc/puppetdb/index.markdown
  Process: 20795 ExecStart=/usr/bin/java $JAVA_ARGS -Djava.security.egd=/dev/urandom -XX:OnOutOfMemoryError=kill -9 %p -cp /usr/share/puppetdb/puppetdb.jar clojure.main -m puppetlabs.puppetdb.core service
 Main PID: 20795 (code=killed, signal=KILL)

Mar 18 12:15:11 deployment-puppetdb03 systemd[1]: puppetdb.service: Main process exited, code=killed, status=9/KILL
Mar 18 12:15:11 deployment-puppetdb03 systemd[1]: puppetdb.service: Failed with result 'signal'.
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Mar 18 12:15:11 deployment-puppetdb03 kernel: [4838203.377081] [  20795]   118 20795  1606110   219063  2199552        0             0 java
Mar 18 12:15:11 deployment-puppetdb03 kernel: [4838203.377143] Out of memory: Kill process 20795 (java) score 429 or sacrifice child
Mar 18 12:15:11 deployment-puppetdb03 kernel: [4838203.378570] Killed process 20795 (java) total-vm:6424440kB, anon-rss:876252kB, file-rss:0kB, shmem-rss:0kB
Mar 18 12:15:11 deployment-puppetdb03 kernel: [4838203.394636] oom_reaper: reaped process 20795 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

You can just log into the box, service puppetdb start, and wait a moment, and it'll come back. But this should not be necessary.
Do we really need to increase the size of this box from small? It shouldn't be holding that much data or taking up 2GB RAM. Though a JVM is involved...

Event Timeline

Krenair created this task.Mar 19 2020, 12:04 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 19 2020, 12:04 AM

This happened again over the weekend. I've restarted it.

And again today.

Krenair claimed this task.Aug 3 2020, 11:57 PM

replacing with a medium instance, deployment-puppetdb04

dpifke added a subscriber: dpifke.Aug 18 2020, 9:16 PM

Restarted again: https://sal.toolforge.org/log/aexZA3QBj_Bg1xd3wLvx (I'll reference this task next time, now that I know it exists.)

Adding Restart=always to the systemd unit would also mitigate having to wait for someone to manually restart it.

There appears to be an unused configuration file which includes this here:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/puppetdb/templates/puppetdb.service.erb

It doesn't seem to be referenced where the service is defined:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/puppetdb/manifests/app.pp

Mentioned in SAL (#wikimedia-releng) [2020-08-25T18:03:37Z] <dpifke> Restarted puppetdb on deployment-puppetdb03 (T248041)

Mentioned in SAL (#wikimedia-releng) [2020-08-27T23:19:47Z] <dpifke> Restarted puppetdb on deployment-puppetdb03 (T248041)

Mentioned in SAL (#wikimedia-operations) [2020-09-08T13:25:49Z] <mateusbs17> Restarted puppetdb on deployment-puppetdb03 (T248041)

Mentioned in SAL (#wikimedia-releng) [2020-09-09T08:43:28Z] <hashar> Restarted puppetdb on deployment-puppetdb03 (T248041)

hashar added a subscriber: hashar.Wed, Sep 9, 8:44 AM

The instance only has 2GB RAM. Maybe the instance flavor can just be changed to get more RAM and then restarted, else we would need to rebuild it from scratch :\

Change 626109 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] deployment-prep: reduce puppetdb memory usage

https://gerrit.wikimedia.org/r/626109

Change 626109 abandoned by Hashar:
[operations/puppet@production] deployment-prep: reduce puppetdb memory usage

Reason:
And looking at the instance it has:

/etc/postgresql/11/main/tuning.conf:shared_buffers = 600MB

Which is set in Horizon https://horizon.wikimedia.org/project/instances/05f311e9-1ef4-4acd-b467-adb59f6c2f93/

profile::puppetdb::database::shared_buffers: 600MB

https://gerrit.wikimedia.org/r/626109

hashar added a comment.Wed, Sep 9, 8:59 AM

The postgresql tunings are:

/etc/postgresql/11/main/tuning.conf
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 8GB
work_mem = 192MB
wal_buffers = 8MB
shared_buffers = 600MB
max_connections = 120

Mentioned in SAL (#wikimedia-releng) [2020-09-09T09:03:48Z] <hashar> deployment-puppetdb03 set profile::puppetdb::jvm_opts: -Xmx256m via Horizon and restarted Puppetdb # T248041

hashar added a comment.Wed, Sep 9, 9:04 AM

The java process is now running with -Xmx256m, was -Xmx4G, that should help

bd808 added a comment.Wed, Sep 9, 5:16 PM

The java process is now running with -Xmx256m, was -Xmx4G, that should help

If this tuning by @hashar works, we should probably consider changing the defaults for Cloud VPS deploys via ops/puppet.git:hieradata/cloud.yaml.