Page MenuHomePhabricator

[cloudvps] 2024-05-01 cloudinfra puppetserver got out of space
Closed, DuplicatePublic

Description

The server got out of space making all client fail to run:

root@clouddb-services-puppetserver-1:~# run-puppet-agent
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: No space left on device - /var/lib/puppetserver/server_data/facts/clouddb-services-puppetserver-1.clouddb-services.eqiad1.wikimedia.cloud.json20240601-3260158-zrafhm.lock
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

Event Timeline

dcaro triaged this task as High priority.Jun 1 2024, 1:35 PM
dcaro created this task.

Root got out of space:

root@cloudinfra-cloudvps-puppetserver-1:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             17G     0   17G   0% /dev
tmpfs           3.4G  592K  3.4G   1% /run
/dev/sdb1        20G   19G  124K 100% /
tmpfs            17G     0   17G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sdb15      124M   12M  113M  10% /boot/efi
/dev/sda        9.8G  4.8G  4.6G  51% /srv
tmpfs           3.4G     0  3.4G   0% /run/user/0
tmpfs           3.4G     0  3.4G   0% /run/user/25603

Most of the usage comes from puppetserver reports:

root@cloudinfra-cloudvps-puppetserver-1:~# du -hs /var/lib/puppetserver/* | sort -h
4.0K    /var/lib/puppetserver/bucket
4.0K    /var/lib/puppetserver/facts.d
4.0K    /var/lib/puppetserver/jruby-gems
4.0K    /var/lib/puppetserver/lib
4.0K    /var/lib/puppetserver/locales
4.0K    /var/lib/puppetserver/preview
4.0K    /var/lib/puppetserver/state
4.0K    /var/lib/puppetserver/yaml
199M    /var/lib/puppetserver/server_data
11G     /var/lib/puppetserver/reports

There's one file per client there, with a couple taking 10x the size of the rest:

root@cloudinfra-cloudvps-puppetserver-1:~# du -hs /var/lib/puppetserver/reports/* | sort -h
...
57M     /var/lib/puppetserver/reports/wikistats-bookworm.wikistats.eqiad1.wikimedia.cloud
325M    /var/lib/puppetserver/reports/wikilabels-staging-02.wikilabels.eqiad1.wikimedia.cloud
347M    /var/lib/puppetserver/reports/wikilabels-03.wikilabels.eqiad1.wikimedia.cloud

The root partition is only 20G, maybe we should have a bigger one? or better move that to a different partition with more space?

Removed some old puppetserver logs, and journal logs (root@cloudinfra-cloudvps-puppetserver-1:/var/log# journalctl --vacuum-time=5d) and freed some space that probably will be enough until monday.

root@cloudinfra-cloudvps-puppetserver-1:/var/log# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             17G     0   17G   0% /dev
tmpfs           3.4G  584K  3.4G   1% /run
/dev/sdb1        20G   17G  1.9G  90% /
tmpfs            17G     0   17G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sdb15      124M   12M  113M  10% /boot/efi
/dev/sda        9.8G  4.8G  4.6G  51% /srv
tmpfs           3.4G     0  3.4G   0% /run/user/0
tmpfs           3.4G     0  3.4G   0% /run/user/25603

Runs getting in:

root@cloudinfra-cloudvps-puppetserver-1:/var/log# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             17G     0   17G   0% /dev
tmpfs           3.4G  584K  3.4G   1% /run
/dev/sdb1        20G   17G  1.9G  90% /
tmpfs            17G     0   17G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sdb15      124M   12M  113M  10% /boot/efi
/dev/sda        9.8G  4.8G  4.6G  51% /srv
tmpfs           3.4G     0  3.4G   0% /run/user/0
tmpfs           3.4G     0  3.4G   0% /run/user/25603

I'll silence them for 1h, and check back later to see if there's any left

Interestingly enough, today there's even more free space (somehow 2G of puppet reports got cleared up):

root@cloudinfra-cloudvps-puppetserver-1:~# du -hs /var/lib/puppetserver/* | sort -h
4.0K    /var/lib/puppetserver/bucket
4.0K    /var/lib/puppetserver/facts.d
4.0K    /var/lib/puppetserver/jruby-gems
4.0K    /var/lib/puppetserver/lib
4.0K    /var/lib/puppetserver/locales
4.0K    /var/lib/puppetserver/preview
4.0K    /var/lib/puppetserver/state
4.0K    /var/lib/puppetserver/yaml
205M    /var/lib/puppetserver/server_data
9.0G    /var/lib/puppetserver/reports

From puppet docs (https://www.puppet.com/docs/puppet/8/report#report-store):

store

Stores the yaml report in the configured reportdir. By default, this is the report processor Puppet uses. These files collect quickly — one every half hour — so be sure to perform maintenance on them if you use this report.

So I guess we should have a custom cleanup timer of sorts

There it is:

Mon 2024-06-03 14:47:21 UTC 6h left             Mon 2024-06-03 06:47:20 UTC 1h 50min ago remove_old_puppet_reports.timer                 remove_old_puppet_reports.service

It's set to 16h:

root@cloudinfra-cloudvps-puppetserver-1:~# systemctl status remove_old_puppet_reports.service
○ remove_old_puppet_reports.service - Clears out older puppet reports.
     Loaded: loaded (/lib/systemd/system/remove_old_puppet_reports.service; static)
     Active: inactive (dead) since Mon 2024-06-03 06:47:21 UTC; 1h 51min ago
TriggeredBy: ● remove_old_puppet_reports.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
    Process: 2825670 ExecStart=/usr/bin/find /var/lib/puppetserver/reports -type f -mmin +960 -delete (code=exited, status=0/SUCCESS)
   Main PID: 2825670 (code=exited, status=0/SUCCESS)
        CPU: 354ms

Jun 03 06:47:20 cloudinfra-cloudvps-puppetserver-1 systemd[1]: Starting remove_old_puppet_reports.service - Clears out older puppet reports....
Jun 03 06:47:21 cloudinfra-cloudvps-puppetserver-1 systemd[1]: remove_old_puppet_reports.service: Deactivated successfully.
Jun 03 06:47:21 cloudinfra-cloudvps-puppetserver-1 systemd[1]: Finished remove_old_puppet_reports.service - Clears out older puppet reports..

root@cloudinfra-cloudvps-puppetserver-1:~# echo $((960/60))
16

Hmm, the timer runs every 12h, so at the peak, it has 16+12=28h of reports.

root@cloudinfra-cloudvps-puppetserver-1:~# ls /var/lib/puppetserver/reports/ | wc
    847     847   48518

847 clients * 2 reports/hour * 28 hours * 8Mb/report (max size I have seen for a report) ≃ 370 Gb of size

We have 20Gb for everything xd, so clearly bad estimate for the size of the report (8Mb it's probably way out of the usual)

oot@cloudinfra-cloudvps-puppetserver-1:~# df  -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1        20G   15G  3.8G  81% /

I guess we can run the cleanup every hour, that should help reduce ~40% of space at the peak (from 28 reports to 17 per host).

Change #1038296 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] puppetserver: allow configuring the report cleanup frequency

https://gerrit.wikimedia.org/r/1038296

Let's see how many reports per-host we have right now:

root@cloudinfra-cloudvps-puppetserver-1:~# ls /var/lib/puppetserver/reports/ | wc
    847     847   48518

root@cloudinfra-cloudvps-puppetserver-1:~# find /var/lib/puppetserver/reports/ -type f  | wc
  15002   15002 1445701

root@cloudinfra-cloudvps-puppetserver-1:~# echo $((15002/847))
17

Ok, so the cron just ran <2h ago:

root@cloudinfra-cloudvps-puppetserver-1:~# systemctl status remove_old_puppet_reports.service | grep Active
     Active: inactive (dead) since Mon 2024-06-03 06:47:21 UTC; 1h 59min ago

Yep. Seems sound to me, will try to run the cron more often, see how that goes.

Change #1038296 abandoned by David Caro:

[operations/puppet@production] puppetserver: allow configuring the report cleanup frequency

Reason:

replaced by a more drastic https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037812

https://gerrit.wikimedia.org/r/1038296