Page MenuHomePhabricator

puppetdb on deployment-puppetdb03 keeps getting OOMKilled
Closed, ResolvedPublic

Description

Following the replacement of puppetdb02 (stretch, old version of puppetdb) with a buster one in T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster)

# service puppetdb status
● puppetdb.service - Puppet data warehouse server
   Loaded: loaded (/lib/systemd/system/puppetdb.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Wed 2020-03-18 12:15:11 UTC; 11h ago
     Docs: man:puppetdb(8)
           file:/usr/share/doc/puppetdb/index.markdown
  Process: 20795 ExecStart=/usr/bin/java $JAVA_ARGS -Djava.security.egd=/dev/urandom -XX:OnOutOfMemoryError=kill -9 %p -cp /usr/share/puppetdb/puppetdb.jar clojure.main -m puppetlabs.puppetdb.core service
 Main PID: 20795 (code=killed, signal=KILL)

Mar 18 12:15:11 deployment-puppetdb03 systemd[1]: puppetdb.service: Main process exited, code=killed, status=9/KILL
Mar 18 12:15:11 deployment-puppetdb03 systemd[1]: puppetdb.service: Failed with result 'signal'.
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Mar 18 12:15:11 deployment-puppetdb03 kernel: [4838203.377081] [  20795]   118 20795  1606110   219063  2199552        0             0 java
Mar 18 12:15:11 deployment-puppetdb03 kernel: [4838203.377143] Out of memory: Kill process 20795 (java) score 429 or sacrifice child
Mar 18 12:15:11 deployment-puppetdb03 kernel: [4838203.378570] Killed process 20795 (java) total-vm:6424440kB, anon-rss:876252kB, file-rss:0kB, shmem-rss:0kB
Mar 18 12:15:11 deployment-puppetdb03 kernel: [4838203.394636] oom_reaper: reaped process 20795 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

You can just log into the box, service puppetdb start, and wait a moment, and it'll come back. But this should not be necessary.
Do we really need to increase the size of this box from small? It shouldn't be holding that much data or taking up 2GB RAM. Though a JVM is involved...

Event Timeline

This happened again over the weekend. I've restarted it.

replacing with a medium instance, deployment-puppetdb04

Restarted again: https://sal.toolforge.org/log/aexZA3QBj_Bg1xd3wLvx (I'll reference this task next time, now that I know it exists.)

Adding Restart=always to the systemd unit would also mitigate having to wait for someone to manually restart it.

There appears to be an unused configuration file which includes this here:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/puppetdb/templates/puppetdb.service.erb

It doesn't seem to be referenced where the service is defined:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/puppetdb/manifests/app.pp

Mentioned in SAL (#wikimedia-releng) [2020-08-25T18:03:37Z] <dpifke> Restarted puppetdb on deployment-puppetdb03 (T248041)

Mentioned in SAL (#wikimedia-releng) [2020-08-27T23:19:47Z] <dpifke> Restarted puppetdb on deployment-puppetdb03 (T248041)

Mentioned in SAL (#wikimedia-operations) [2020-09-08T13:25:49Z] <mateusbs17> Restarted puppetdb on deployment-puppetdb03 (T248041)

Mentioned in SAL (#wikimedia-releng) [2020-09-09T08:43:28Z] <hashar> Restarted puppetdb on deployment-puppetdb03 (T248041)

The instance only has 2GB RAM. Maybe the instance flavor can just be changed to get more RAM and then restarted, else we would need to rebuild it from scratch :\

Change 626109 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] deployment-prep: reduce puppetdb memory usage

https://gerrit.wikimedia.org/r/626109

Change 626109 abandoned by Hashar:
[operations/puppet@production] deployment-prep: reduce puppetdb memory usage

Reason:
And looking at the instance it has:

/etc/postgresql/11/main/tuning.conf:shared_buffers = 600MB

Which is set in Horizon https://horizon.wikimedia.org/project/instances/05f311e9-1ef4-4acd-b467-adb59f6c2f93/

profile::puppetdb::database::shared_buffers: 600MB

https://gerrit.wikimedia.org/r/626109

The postgresql tunings are:

/etc/postgresql/11/main/tuning.conf
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 8GB
work_mem = 192MB
wal_buffers = 8MB
shared_buffers = 600MB
max_connections = 120

Mentioned in SAL (#wikimedia-releng) [2020-09-09T09:03:48Z] <hashar> deployment-puppetdb03 set profile::puppetdb::jvm_opts: -Xmx256m via Horizon and restarted Puppetdb # T248041

The java process is now running with -Xmx256m, was -Xmx4G, that should help

The java process is now running with -Xmx256m, was -Xmx4G, that should help

If this tuning by @hashar works, we should probably consider changing the defaults for Cloud VPS deploys via ops/puppet.git:hieradata/cloud.yaml.

alex@alex-laptop:~$ ssh deployment-puppetdb03
Linux deployment-puppetdb03 4.19.0-11-amd64 #1 SMP Debian 4.19.146-1 (2020-09-17) x86_64
Debian GNU/Linux 10 (buster)
deployment-puppetdb03 is a PuppetDB server (puppetmaster::puppetdb (postgres master))
The last Puppet run was at Thu Nov 26 20:45:39 UTC 2020 (1890 minutes ago). 
Last puppet commit: 
Last login: Sun Jul 26 10:45:20 2020 from 172.16.1.136
krenair@deployment-puppetdb03:~$ sudo service puppetdb  status
● puppetdb.service - Puppet data warehouse server
   Loaded: loaded (/lib/systemd/system/puppetdb.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Thu 2020-11-26 21:15:48 UTC; 1 day 7h ago
     Docs: man:puppetdb(8)
           file:/usr/share/doc/puppetdb/index.markdown
 Main PID: 519 (code=killed, signal=KILL)

Nov 26 21:15:48 deployment-puppetdb03 systemd[1]: puppetdb.service: Main process exited, code=killed, status=9/KILL
Nov 26 21:15:48 deployment-puppetdb03 systemd[1]: puppetdb.service: Failed with result 'signal'.
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
krenair@deployment-puppetdb03:~$ sudo dmesg | grep -i oom
[4768043.879807] postgres invoked oom-killer: gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0
[4768043.879849]  oom_kill_process.cold.30+0xb/0x1cf
[4768043.879850]  ? oom_badness+0x23/0x140
[4768043.879909] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[4768043.890672] oom_reaper: reaped process 519 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Based on dmesg the oom hasn't happened since beginning of October so that is an improvement. Some request to postgre required more memory than available apparently which triggered the oom killer.

The deployment-puppetdb03 instance has just 2 GB of memory, I guess we can get it resized to a slightly larger flavor? Do note there is also a deployment-puppetdb04 instance but it comes with 4GB of memory. I have no idea what those instances are for though.

Mentioned in SAL (#wikimedia-releng) [2020-12-14T08:34:12Z] <hashar> deployment-prep restart puppetdb process on deployment-puppetdb03 # T248041

Mentioned in SAL (#wikimedia-releng) [2020-12-17T18:32:43Z] <dpifke> Restart puppetdb on deployment-puppetdb03 (T248041)

The deployment-puppetdb03 instance has just 2 GB of memory, I guess we can get it resized to a slightly larger flavor?

Requested via T270420 since that is apparently possible :]

hashar claimed this task.

The instance went from 2G to 4G:

deployment-puppetdb03_more_mem.png (476×915 px, 38 KB)

And a side effect it does less io per seconds:

deployment-puppetdb03_iops.png (469×900 px, 47 KB)