[cloudvps] 2024-05-01 cloudinfra puppetserver got out of space
Closed, DuplicatePublic
Actions

Assigned To

Authored By

	dcaro
	Jun 1 2024, 1:35 PM

Description

The server got out of space making all client fail to run:

root@clouddb-services-puppetserver-1:~# run-puppet-agent
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: No space left on device - /var/lib/puppetserver/server_data/facts/clouddb-services-puppetserver-1.clouddb-services.eqiad1.wikimedia.cloud.json20240601-3260158-zrafhm.lock
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

Details

	Subject	Repo	Branch	Lines +/-
	puppetserver: allow configuring the report cleanup frequency	operations/puppet	production	+14 -7

Customize query in gerrit

Related Objects

Mentioned In: T364047: puppet servers run out of inodes in puppet code volume

Event Timeline

dcaro triaged this task as High priority.Jun 1 2024, 1:35 PM

dcaro created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 1 2024, 1:35 PM

Root got out of space:

root@cloudinfra-cloudvps-puppetserver-1:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             17G     0   17G   0% /dev
tmpfs           3.4G  592K  3.4G   1% /run
/dev/sdb1        20G   19G  124K 100% /
tmpfs            17G     0   17G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sdb15      124M   12M  113M  10% /boot/efi
/dev/sda        9.8G  4.8G  4.6G  51% /srv
tmpfs           3.4G     0  3.4G   0% /run/user/0
tmpfs           3.4G     0  3.4G   0% /run/user/25603

Most of the usage comes from puppetserver reports:

root@cloudinfra-cloudvps-puppetserver-1:~# du -hs /var/lib/puppetserver/* | sort -h
4.0K    /var/lib/puppetserver/bucket
4.0K    /var/lib/puppetserver/facts.d
4.0K    /var/lib/puppetserver/jruby-gems
4.0K    /var/lib/puppetserver/lib
4.0K    /var/lib/puppetserver/locales
4.0K    /var/lib/puppetserver/preview
4.0K    /var/lib/puppetserver/state
4.0K    /var/lib/puppetserver/yaml
199M    /var/lib/puppetserver/server_data
11G     /var/lib/puppetserver/reports

There's one file per client there, with a couple taking 10x the size of the rest:

root@cloudinfra-cloudvps-puppetserver-1:~# du -hs /var/lib/puppetserver/reports/* | sort -h
...
57M     /var/lib/puppetserver/reports/wikistats-bookworm.wikistats.eqiad1.wikimedia.cloud
325M    /var/lib/puppetserver/reports/wikilabels-staging-02.wikilabels.eqiad1.wikimedia.cloud
347M    /var/lib/puppetserver/reports/wikilabels-03.wikilabels.eqiad1.wikimedia.cloud

The root partition is only 20G, maybe we should have a bigger one? or better move that to a different partition with more space?

Removed some old puppetserver logs, and journal logs (root@cloudinfra-cloudvps-puppetserver-1:/var/log# journalctl --vacuum-time=5d) and freed some space that probably will be enough until monday.

root@cloudinfra-cloudvps-puppetserver-1:/var/log# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             17G     0   17G   0% /dev
tmpfs           3.4G  584K  3.4G   1% /run
/dev/sdb1        20G   17G  1.9G  90% /
tmpfs            17G     0   17G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sdb15      124M   12M  113M  10% /boot/efi
/dev/sda        9.8G  4.8G  4.6G  51% /srv
tmpfs           3.4G     0  3.4G   0% /run/user/0
tmpfs           3.4G     0  3.4G   0% /run/user/25603

Runs getting in:

root@cloudinfra-cloudvps-puppetserver-1:/var/log# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             17G     0   17G   0% /dev
tmpfs           3.4G  584K  3.4G   1% /run
/dev/sdb1        20G   17G  1.9G  90% /
tmpfs            17G     0   17G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sdb15      124M   12M  113M  10% /boot/efi
/dev/sda        9.8G  4.8G  4.6G  51% /srv
tmpfs           3.4G     0  3.4G   0% /run/user/0
tmpfs           3.4G     0  3.4G   0% /run/user/25603

I'll silence them for 1h, and check back later to see if there's any left

• OKJ04 added a subtask: T366433: CentralAuth tests broken unless you run them inside Quibble.Jun 3 2024, 1:39 AM

JJMC89 removed a subtask: T366433: CentralAuth tests broken unless you run them inside Quibble.Jun 3 2024, 1:44 AM

dcaro mentioned this in T364047: puppet servers run out of inodes in puppet code volume.Jun 3 2024, 8:26 AM

Interestingly enough, today there's even more free space (somehow 2G of puppet reports got cleared up):

root@cloudinfra-cloudvps-puppetserver-1:~# du -hs /var/lib/puppetserver/* | sort -h
4.0K    /var/lib/puppetserver/bucket
4.0K    /var/lib/puppetserver/facts.d
4.0K    /var/lib/puppetserver/jruby-gems
4.0K    /var/lib/puppetserver/lib
4.0K    /var/lib/puppetserver/locales
4.0K    /var/lib/puppetserver/preview
4.0K    /var/lib/puppetserver/state
4.0K    /var/lib/puppetserver/yaml
205M    /var/lib/puppetserver/server_data
9.0G    /var/lib/puppetserver/reports

From puppet docs (https://www.puppet.com/docs/puppet/8/report#report-store):

store

Stores the yaml report in the configured reportdir. By default, this is the report processor Puppet uses. These files collect quickly — one every half hour — so be sure to perform maintenance on them if you use this report.

So I guess we should have a custom cleanup timer of sorts

There it is:

Mon 2024-06-03 14:47:21 UTC 6h left             Mon 2024-06-03 06:47:20 UTC 1h 50min ago remove_old_puppet_reports.timer                 remove_old_puppet_reports.service

It's set to 16h:

root@cloudinfra-cloudvps-puppetserver-1:~# systemctl status remove_old_puppet_reports.service
○ remove_old_puppet_reports.service - Clears out older puppet reports.
     Loaded: loaded (/lib/systemd/system/remove_old_puppet_reports.service; static)
     Active: inactive (dead) since Mon 2024-06-03 06:47:21 UTC; 1h 51min ago
TriggeredBy: ● remove_old_puppet_reports.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
    Process: 2825670 ExecStart=/usr/bin/find /var/lib/puppetserver/reports -type f -mmin +960 -delete (code=exited, status=0/SUCCESS)
   Main PID: 2825670 (code=exited, status=0/SUCCESS)
        CPU: 354ms

Jun 03 06:47:20 cloudinfra-cloudvps-puppetserver-1 systemd[1]: Starting remove_old_puppet_reports.service - Clears out older puppet reports....
Jun 03 06:47:21 cloudinfra-cloudvps-puppetserver-1 systemd[1]: remove_old_puppet_reports.service: Deactivated successfully.
Jun 03 06:47:21 cloudinfra-cloudvps-puppetserver-1 systemd[1]: Finished remove_old_puppet_reports.service - Clears out older puppet reports..

root@cloudinfra-cloudvps-puppetserver-1:~# echo $((960/60))
16

Hmm, the timer runs every 12h, so at the peak, it has 16+12=28h of reports.

root@cloudinfra-cloudvps-puppetserver-1:~# ls /var/lib/puppetserver/reports/ | wc
    847     847   48518

847 clients * 2 reports/hour * 28 hours * 8Mb/report (max size I have seen for a report) ≃ 370 Gb of size

We have 20Gb for everything xd, so clearly bad estimate for the size of the report (8Mb it's probably way out of the usual)

oot@cloudinfra-cloudvps-puppetserver-1:~# df  -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1        20G   15G  3.8G  81% /

I guess we can run the cleanup every hour, that should help reduce ~40% of space at the peak (from 28 reports to 17 per host).

Change #1038296 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] puppetserver: allow configuring the report cleanup frequency

https://gerrit.wikimedia.org/r/1038296

gerritbot added a project: Patch-For-Review.Jun 3 2024, 1:01 PM

Let's see how many reports per-host we have right now:

root@cloudinfra-cloudvps-puppetserver-1:~# ls /var/lib/puppetserver/reports/ | wc
    847     847   48518

root@cloudinfra-cloudvps-puppetserver-1:~# find /var/lib/puppetserver/reports/ -type f  | wc
  15002   15002 1445701

root@cloudinfra-cloudvps-puppetserver-1:~# echo $((15002/847))
17

Ok, so the cron just ran <2h ago:

root@cloudinfra-cloudvps-puppetserver-1:~# systemctl status remove_old_puppet_reports.service | grep Active
     Active: inactive (dead) since Mon 2024-06-03 06:47:21 UTC; 1h 59min ago

Yep. Seems sound to me, will try to run the cron more often, see how that goes.

dcaro closed this task as a duplicate of T366357: cloud-vps puppetservers filling up / with puppetserver reports.Jun 3 2024, 2:20 PM

Change #1038296 abandoned by David Caro:

[operations/puppet@production] puppetserver: allow configuring the report cleanup frequency

Reason:

replaced by a more drastic https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037812