Page MenuHomePhabricator

toolsbeta: puppet failing on multiple hosts
Closed, ResolvedPublic

Description

There's many alerts triggering for toolsbeta complaining that puppet was unable to get resources from https://prometheus-alerts.wmcloud.org/?q=%40receiver%21%3Dblackhole:

 Copy link to this group
Silence this group
alertname: PuppetAgentNoResources
41
summary: No Puppet resources found on instance toolsbeta-sgewebgen-10-2 on project toolsbeta
a few seconds agoinstance: toolsbeta-sgewebgen-10-2
summary: No Puppet resources found on instance toolsbeta-docker-imagebuilder-01 on project toolsbeta
a few seconds agoinstance: toolsbeta-docker-imagebuilder-01
summary: No Puppet resources found on instance toolsbeta-test-k8s-etcd-18 on project toolsbeta
a few seconds agoinstance: toolsbeta-test-k8s-etcd-18
summary: No Puppet resources found on instance toolsbeta-sgegrid-master on project toolsbeta
3 minutes agoinstance: toolsbeta-sgegrid-master
summary: No Puppet resources found on instance toolsbeta-test-k8s-etcd-17 on project toolsbeta
3 minutes agoinstance: toolsbeta-test-k8s-etcd-17
5 of 41

It seems puppetdb is not replying:

root@toolsbeta-docker-registry-02:~# run-puppet-agent
Warning: Unable to fetch my node definition, but the agent run will continue:
Warning: Error 500 on SERVER: Server Error: Could not retrieve facts for toolsbeta-docker-registry-02.toolsbeta.eqiad1.wikimedia.cloud: Failed to find facts from PuppetDB at puppet:8140: Failed to execute '/pdb/query/v4/nodes/toolsbeta-docker-registry-02.toolsbeta.eqiad1.wikimedia.cloud/facts' on at least 1 of the following 'server_urls': https://toolsbeta-puppetdb-02.toolsbeta.eqiad.wmflabs
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed to execute '/pdb/cmd/v1?checksum=8b3b756527a874121998cd45f6cb036cb087eeab&version=5&certname=toolsbeta-docker-registry-02.toolsbeta.eqiad1.wikimedia.cloud&command=replace_facts&producer-timestamp=2022-12-09T08:33:33.337Z' on at least 1 of the following 'server_urls': https://toolsbeta-puppetdb-02.toolsbeta.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

looking

Event Timeline

dcaro changed the task status from Open to In Progress.
dcaro triaged this task as High priority.
dcaro moved this task from To refine to Doing on the User-dcaro board.

Puppetdb had failed, though there's not a lot in the logs:

root@toolsbeta-puppetdb-02:~# journalctl -u puppetdb
-- Logs begin at Thu 2022-12-08 06:36:26 UTC, end at Fri 2022-12-09 08:36:39 UTC. --
Dec 09 07:19:11 toolsbeta-puppetdb-02 systemd[1]: puppetdb.service: Main process exited, code=killed, status=9/KILL
Dec 09 07:19:11 toolsbeta-puppetdb-02 systemd[1]: puppetdb.service: Failed with result 'signal'.

Manually started it and now it works:

root@toolsbeta-puppetdb-02:~# systemctl start puppetdb

root@toolsbeta-puppetdb-02:~# journalctl -u puppetdb -f
-- Logs begin at Thu 2022-12-08 09:55:26 UTC. --
...
Dec 09 08:38:03 toolsbeta-puppetdb-02 java[7118]: 08:38:03.283 [main] DEBUG puppetlabs.trapperkeeper.bootstrap - Loading bootstrap config from classpath: 'jar:file:/usr/share/puppetdb/puppetdb.jar!/bootstrap.cfg'
Dec 09 08:38:21 toolsbeta-puppetdb-02 java[7118]: WARNING: An illegal reflective access operation has occurred
Dec 09 08:38:21 toolsbeta-puppetdb-02 java[7118]: WARNING: Illegal reflective access by clojure.lang.InjectedInvoker/0x00000008400c3440 (file:/usr/share/java/clojure.jar) to method sun.nio.ch.ChannelInputStream.close()
Dec 09 08:38:21 toolsbeta-puppetdb-02 java[7118]: WARNING: Please consider reporting this to the maintainers of clojure.lang.InjectedInvoker/0x00000008400c3440
Dec 09 08:38:21 toolsbeta-puppetdb-02 java[7118]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
Dec 09 08:38:21 toolsbeta-puppetdb-02 java[7118]: WARNING: All illegal access operations will be denied in a future release

It was killed by the oom:

root@toolsbeta-puppetdb-02:~# dmesg -T
...
[Fri Dec  9 07:18:59 2022] VM Periodic Tas invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
...
[Fri Dec  9 07:18:59 2022] Out of memory: Kill process 698 (java) score 356 or sacrifice child
[Fri Dec  9 07:18:59 2022] Killed process 698 (java) total-vm:6428804kB, anon-rss:726284kB, file-rss:0kB, shmem-rss:0kB

Looking if puppet tried to start it again or not

It did not, because it depends on itself xd

root@toolsbeta-puppetdb-02:~# systemctl stop puppetdb

root@toolsbeta-puppetdb-02:~# run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed to execute '/pdb/cmd/v1?checksum=1265a4910d45c141600a1934f07dc34109960e66&version=5&certname=toolsbeta-puppetdb-02.toolsbeta.eqiad.wmflabs&command=replace_facts&producer-timestamp=2022-12-09T08:43:36.763Z' on at least 1 of the following 'server_urls': https://toolsbeta-puppetdb-02.toolsbeta.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

Mentioned in SAL (#wikimedia-cloud) [2022-12-09T08:45:44Z] <dcaro> manually started puppetdb after killed by oom (T324812)

Change 866552 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] puppetdb: restart through systemd if service dies

https://gerrit.wikimedia.org/r/866552

The above patch should avoid this from hapenning again.

Change 866552 merged by David Caro:

[operations/puppet@production] puppetdb: restart through systemd if service dies

https://gerrit.wikimedia.org/r/866552

Deployed and tested, OOM should not break puppetdb again :)

dcaro moved this task from Doing to Done on the User-dcaro board.