Page MenuHomePhabricator

Create a cron to clean clientbucket every day or hour
Closed, ResolvedPublic

Description

Hi, could clientbucket be cleaned regularly with a cron? I found that on a instance in labs that the files in that folder were using a lot of storage. I cleaned like 5gb (estimate) of the files in that folder.

Example:

-r--r----- 1 root 49M Jan 18 18:20 /var/lib/puppet/clientbucket/6/e/2/c/d/4/9/8/6e2cd498423a6d69b20ecbce78c2611e/contents
-r--r----- 1 root 49M Jan 18 18:21 /var/lib/puppet/clientbucket/f/4/5/f/9/a/2/4/f45f9a24a2c91508a4d6823b7d68048d/contents
-r--r----- 1 root 66M Jan 18 18:22 /var/lib/puppet/clientbucket/f/0/2/d/9/d/5/6/f02d9d567a008692c67c5891b657e015/contents
-r--r----- 1 root 49M Jan 18 18:20 /var/lib/puppet/clientbucket/a/7/d/9/0/d/1/0/a7d90d10d1a912e7bd3b5a65b03f3da8/contents
-r--r----- 1 root 29M Jan 18 18:19 /var/lib/puppet/clientbucket/e/b/a/4/a/f/2/4/eba4af2418e8379f3d9a163452d4b163/contents

files were from January. I didn't even realise there was a client bucket folder until today when i tried to clean some storage up.

Event Timeline

Yes, please. My multiple labs instances run out of space in /var and this basically blocks "dpkg". As non-puppet expert, this took me a bit of time to figure out what to do to solve this pb. If I have a look to the solution Mozilla has implemented, this sounds quite trivial.

If I have a look to the solution Mozilla has implemented, this sounds quite trivial.

The option to use "tidy" resource in puppet itself seemed an option but people pointed out it uses way more CPU than a cron/timer with find.

Made the change above to use a systemd timer to delete files older than 14 days but making it an opt-in thing affected users can enable in Hiera. I don't think we can actually know if some are using the client file bucket, though it seems unlikely.

Change 635406 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] base/labs: add systemd timer to clean puppet client bucket

https://gerrit.wikimedia.org/r/635406

@Paladox Please see the change above. still interested in this?

Change 635406 merged by Dzahn:
[operations/puppet@production] base/labs: add systemd timer to clean puppet client bucket

https://gerrit.wikimedia.org/r/635406

dzahn@wikistats-dancing-goat:~$ sudo systemctl start cleanup_puppet_client_bucket.timer
dzahn@wikistats-dancing-goat:~$ sudo systemctl status cleanup_puppet_client_bucket.timer
● cleanup_puppet_client_bucket.timer - Periodic execution of cleanup_puppet_client_bucket.service
   Loaded: loaded (/lib/systemd/system/cleanup_puppet_client_bucket.timer; enabled; vendor preset: enabled)
   Active: active (waiting) since Wed 2020-10-28 17:56:21 UTC; 5min ago
  Trigger: Thu 2020-10-29 17:56:24 UTC; 23h left

Oct 28 17:56:21 wikistats-dancing-goat systemd[1]: Started Periodic execution of cleanup_puppet_client_bucket.servic

^ This is the instance I added it to for testing / as an example.

On another instance it was noop of course.

@Paladox Wanna try it?

  • sudo du -hs /var/lib/puppet/clientbucket
  • add "profile::base::labs::cleanup_puppet_client_bucket: true" to your test instance in Hiera
  • run puppet
  • sudo systemctl status cleanup_puppet_client_bucket.timer
  • sudo systemctl start cleanup_puppet_client_bucket.timer
  • sudo systemctl status cleanup_puppet_client_bucket.timer
  • sudo du -hs /var/lib/puppet/clientbucket

?:)

This should be resolved, giving back to original requestor Paladox to confirm it.

@jbond @Dzahn I got bitten by this problem in production 2/3 times as well (today with an-launcher1002), would it be worth to have a generic cleanup timer everywhere (haven't found one) ?

Change 715220 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] create a generic class to clean the puppet client bucket

https://gerrit.wikimedia.org/r/715220

@jbond @Dzahn I got bitten by this problem in production 2/3 times as well (today with an-launcher1002), would it be worth to have a generic cleanup timer everywhere (haven't found one) ?

@elukey Sure: https://gerrit.wikimedia.org/r/c/operations/puppet/+/715220

no issue with the change, however @elukey looking at an-launcher1002 there is 22GB of space free, if the filebucket is growing to take up that much space (even 1G is very highfor this folder) then something else is probably wrong.

I did a quick look and founf the following systems have a large clientbucket

  • authdns[1001,2001].wikimedia.org,dns[1001-1002,2001-2002,3001-3002,4001-4002,5001-5002].wikimedia.org: 7.4G
  • an-test-coord1001.eqiad.wmnet: 2.4G
  • puppetmaster2001.codfw.wmnet,puppetmaster1001.eqiad.wmnet: 4.7G

It seems an-test-cord is/was manging files under /var/log/camus/ which is causing the growth

an-test-coord1001
$ find  /var/lib/puppet/clientbucket -type f -size +100M | while read line ; do  cat "$(dirname ${line})"/paths ; done | uniq
/var/log/camus/camus-eventgate-analytics_events/eventgate-analytics_events.log.1
/var/log/camus/camus-webrequest/webrequest.log
/var/log/camus/camus-webrequest/webrequest.log.1
/var/log/camus/camus-eventlogging/eventlogging.log.1
/var/log/camus/camus-eventgate-main_events/eventgate-main_events.log.1

For dns its /etc/gdnsd/geoip/GeoIP2-City.mmdb

authdns1001
$ find  /var/lib/puppet/clientbucket -type f -size +100M | while read line ; do  cat "$(dirname ${line})"/paths ; done | uniq
/etc/gdnsd/geoip/GeoIP2-City.mmdb

and for the puppetmasters seem to be having issues with l0otsof small files, i.e. changing on nearly every run

puppetmaster1001
$  find  /var/lib/puppet/clientbucket -type f -name paths -exec cat {} + | sort | uniq -c | sort -n | tail -10
     18 /usr/local/sbin/smart-data-dump
     19 /etc/apache2/sites-available/40-puppet.conf
     20 /usr/lib/nagios/plugins/check_microcode
     24 /etc/rsyslog.lookup.d/lookup_table_output.json
     29 /usr/local/bin/puppet-merge
     47 /etc/ferm/conf.d/00_defs
   1816 /srv/config-master/ssh-fingerprints.txt
   2585 /srv/config-master/known_hosts.rsa
   2617 /srv/config-master/known_hosts.ed25519
   2631 /srv/config-master/known_hosts.ecdsa

Change 715228 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] P:configmaster: don't backup known-hosts and fingerprint files

https://gerrit.wikimedia.org/r/715228

Change 715230 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] P:dns::auth::config: use backup false for GeoIP2-City.mmdb

https://gerrit.wikimedia.org/r/715230

Change 715230 merged by Jbond:

[operations/puppet@production] P:dns::auth::config: use backup false for GeoIP2-City.mmdb

https://gerrit.wikimedia.org/r/715230

I have fixed the issues on authdns and puppetmaster

@elukey next time you see the issue on an-launcher1002 can you run the two lines used above (for authdns and puppetmaster) to get an idea of which resource(s) are causing issues

@jbond sure! Question - is there a problem with the /var/log/camus directories on an-test-coord1001? I am trying to understand if we have something misconfigured or if it is puppet's fault :D

Change 715228 merged by Jbond:

[operations/puppet@production] P:configmaster: don't backup known-hosts and fingerprint files

https://gerrit.wikimedia.org/r/715228

Change 715220 merged by Dzahn:

[operations/puppet@production] create a generic class to clean the puppet client bucket

https://gerrit.wikimedia.org/r/715220

@jbond sure! Question - is there a problem with the /var/log/camus directories on an-test-coord1001? I am trying to understand if we have something misconfigured or if it is puppet's fault :D

Sorry completely missed this response :/. im not sure there is a problem and i couldn't see anything in puppet which related to that directory or file. however at some point puppet created a backup of files in /var/log/camus. this means at some point puppet must have been managing the files in there, but perhaps that puppet code already got removed?

Change 719293 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] P:puppet: Add alerting for large files in client bucket

https://gerrit.wikimedia.org/r/719293

Change 719293 merged by Jbond:

[operations/puppet@production] P:puppet: Add alerting for large files in client bucket

https://gerrit.wikimedia.org/r/719293

Mentioned in SAL (#wikimedia-operations) [2021-10-19T21:34:10Z] <mutante> mwmaint1002 - delete large files over 100MB from puppet clientbucket. sudo /usr/bin/find /var/lib/puppet/clientbucket/ -type f -size +100M -delete | fixed Icinga alert: RECOVERY - Check for large files in client bucket on mwmaint1002 is OK: OK: T165885

@jbond I had an actual alert for this on mwmaint1002, looked up whether we apply this in prod or only cloud so far or if I should opt-in the host in Hiera. But then decided to just manually delete the files like above and that cleared it. Should we say in our "runbook" link for the check that users should do this? Or should we just apply this puppetized job (to mwmaint).

@Paladox Since it's now possible to opt-in to this and then get a timer (see T165885#6585808 if you still want to try it), is this ticket resolved for you?

@jbond Was there a reason you wanted to keep this open?

A "cron" (timer) has been created. So it could be called resolved. The only thing is that this is opt-in and not automatically for all and I am not sure if it should or we just keep it this way.

Anyone who wants this timer can use "profile::base::labs::cleanup_puppet_client_bucket: true" in Hiera. ->T165885#6585808

@jbond Any thoughts on this from your end?

@Paladox: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action...Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!

Dzahn claimed this task.

I am going to be bold and call it resolved. Based on my previous comments. We created a Hiera key that anyone can use for this if they want to. It's just opt-in based on project or instance.

And separately I am fairly certain Paladox isn't currently invested in this any longer.

@Paladox If I am wrong about any of this, please feel free to just reopen this any time you like.