Page MenuHomePhabricator

Recreate integration-puppetmaster with new image (/var/ is too small)
Closed, ResolvedPublic

Description

https://tools.wmflabs.org/nagf/?project=integration#h_integration-puppetmaster_disk

There's a constant stream of writes to the disk in this directory, and every few hours it trims (or compresses?), but never back to the same size. It's gradually growing and growing and since the disk is only like 2GB, it's too full.

Notification Type: PROBLEM

Service: Free space - all mounts
Host: integration-puppetmaster
Address: 10.68.16.96
State: WARNING

Date/Time: Fri 23 Jan 23:44:42 UTC 2015

Additional Info:

WARNING: integration.integration-puppetmaster.diskspace._var.byte_percentfree.value (<30.00%)

Screen_Shot_2015-01-23_at_16.28.39.png (543×876 px, 134 KB)

Event Timeline

Krinkle raised the priority of this task from to Unbreak Now!.
Krinkle updated the task description. (Show Details)
Krinkle added a subscriber: Krinkle.

We should probably just reimage this server.

I am pretty sure that is due to puppet reports under /var/lib/puppet/reports . The yaml reports are deleted every 360 minutes via a cron job (that was T75472):

27 0,8,16 * * * find /var/lib/puppet/reports -type f -mmin +360 -delete

Seems consistent with the bytes used for /var

integration-puppetmaster_var_disk_usage (309×887 px, 30 KB)

So yeah lets delete the instance and rebuild it using the new image. Will need to update all the instances configuration to update the puppet and salt fingerprint and restart / cleanup puppet and salt agent on all instances.

hashar renamed this task from Disk space "/var" full on integration-puppetmaster to Recreate integration-puppetmaster with new image (/var/ is too small).Feb 6 2015, 9:21 AM
hashar lowered the priority of this task from Unbreak Now! to Medium.
hashar set Security to None.

Lowered the priority cause that is not doing any harm beside the annoying disk alarms.

Krinkle raised the priority of this task from Medium to High.

Ran into this error on local instances a bunch of times today.

Warning: Error 400 on SERVER: cannot generate tempfile `/var/lib/puppet/yaml/node/i-000008ce.eqiad.wmflabs.yaml20150226-16857-1arpl16-9'

Caused by integration-puppetmaster having /var disk full (1.8GB of 1.9GB). Biggest offender was /var/log/puppet/reports at 1.1GB. Purged manually for now.

@hashar: The "Last day" and "Last month" graphs show a deceivingly steady up/down pattern. Beware that it does not go as far down as it goes up each time. This can be seen in the year view. Over time it is slowly clogging up, which is why puppermaster goes down every other month. Getting a larger disk should move that to slower interval, but there's definitely a non-trivial amount of space being occupied that keeps growing without proper boundaries. Any idea what that might be?

Screen_Shot_2015-02-27_at_00.08.31.png (533×860 px, 122 KB)

The puppet master yaml report files are being garbage collected by a cronjob. That has been done by YuviPanda because /var was filling quickly.

The value is currently set to 360 minutes at https://wikitech.wikimedia.org/wiki/Hiera:Integration :

---
"puppetmaster::scripts::keep_reports_minutes": 360

As I said before, that is not the problem nor the solution. The line can be seen as alternating between up and down in short-term graphs. That is indeed the puppet log being created and the cron clearing it

However there is additional build up of data that is not purged. Zoom out to "Last year" graphs in which the hourly up/down isn't so prominent, and the disk is clearly going downhill.

Something is broken apparently:

integration-puppetmaster# crontab -l -u puppet
27 0,8,16 * * * find /var/lib/puppet/reports -type f -mmin +2160 -delete

I have deleted some of them to recover some disk space:

# find /var/lib/puppet/reports -type f -mmin +160 -delete                                
root@integration-puppetmaster:/var/lib/puppet/reports# df -h .                                                                                 
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda2       1.9G  1.1G  799M  57% /var

Change 193825 had a related patch set uploaded (by Hashar):
contint: keep 180 min of puppet reports

https://gerrit.wikimedia.org/r/193825

hashar removed hashar as the assignee of this task.Mar 2 2015, 2:57 PM
Krinkle lowered the priority of this task from High to Medium.Mar 2 2015, 3:33 PM

I have deleted integration-puppet master . Will have to reapply on operations/puppet the changes I5335ea7cbfba33e84b3ddc6e3dd83a7232b8acfd and I30e5bfeac398e0f88e538c75554439fe82fcc1cf

It is being recreated as a m1.small (1 CPU) instance: Created instance i-000008fb with image "ubuntu-14.04-trusty" and hostname i-000008fb.eqiad.wmflabs.

I have recreated the instance and applied the two changes mentioned above.

We have to reestablish the cert connection, the only reliable way I found is to run on each puppet client:

sudo -s
rm -fR /var/lib/puppet/client/ssl/
puppet agent -tv

Then sign the certificate on the puppetmaster:

puppet cert sign --all

And run puppet agent -tv again on the client.

hashar claimed this task.

A new instance has been created reusing the same name (integration-puppetmaster). All nodes have been migrated properly.

The new instance has been recreated as a Trusty one which comes with a different puppet version not supported by ops. So we want to downgrade to Precise T94927: Downgrade intergration-puppetmaster back to Ubuntu Precise (re-create instance).

Change 193825 abandoned by Hashar:
contint: keep 180 min of puppet reports

Reason:
We could entirely disable reporting instead :D

https://gerrit.wikimedia.org/r/193825