Page MenuHomePhabricator

cloudvirt1024: /srv full 99%
Closed, ResolvedPublic

Description

Icinga alerted of cloudvirt1024 /srv being 99% full. This is true, and apparently related to backy

aborrero@cloudvirt1024:~ $ df -h
Filesystem             Size  Used Avail Use% Mounted on
udev                   252G     0  252G   0% /dev
tmpfs                   51G  3.0G   48G   6% /run
/dev/sda1               84G   38G   43G  47% /
tmpfs                  252G     0  252G   0% /dev/shm
tmpfs                  5.0M     0  5.0M   0% /run/lock
tmpfs                  252G     0  252G   0% /sys/fs/cgroup
/dev/mapper/tank-data   14T   14T  238G  99% /srv
tmpfs                   51G     0   51G   0% /run/user/18194
aborrero@cloudvirt1024:/srv $ sudo du -h --max-depth=1
720K	./053c5d13-093d-4775-a3ca-cc6f610b0180
2.4G	./_base
0	./locks
14T	./backy2
14T	.
aborrero@cloudvirt1024:/srv $ sudo backy2 ls | wc -l
2340

We are storing many backups, we should probably purge old ones.

Event Timeline

The purge_vm_backup systemd timer (which runs wmcs-purge-backups) is working fine, the problem seems to be that the script doesn't purge enough data.

I see all VMS have a snapshot that expires tomorrow. Perhaps this is just a matter of adjusting how long we keep snapshots.

I'm trying to gain a bit of space by forcing the expiration of the oldest snapshot. I'm forcing the expiration by using a date from yesterday 2020-12-03:

aborrero@cloudvirt1024:~$ for i in $(sudo backy2 -ms ls | awk -F'|' '{print $2}' | uniq ) ; do sudo backy2 expire $(sudo backy2 -ms ls | grep $i | head -1 | awk -F'|' '{print $6}') 2020-12-03 ; done
    INFO: [backy2.logging] $ /usr/bin/backy2 expire 97ae1e56-311d-11eb-b5a6-b02628295df0 2020-12-03
    INFO: [backy2.logging] Backy complete.

    INFO: [backy2.logging] $ /usr/bin/backy2 expire 11da150a-3122-11eb-816d-b02628295df0 2020-12-03
    INFO: [backy2.logging] Backy complete.
[..]

Mentioned in SAL (#wikimedia-cloud) [2020-12-04T10:23:42Z] <arturo> setting expiration to 2020-12-03 to the oldest backy snapshot of every VM in cloudvirt1024 (T269419)

Mentioned in SAL (#wikimedia-cloud) [2020-12-04T10:28:48Z] <arturo> manually running wmcs-purge-backups on cloudvirt1024 (T269419)

No luck with that, the purge script can't free space:

aborrero@cloudvirt1024:~ $ sudo wmcs-purge-backups 
[...]
    INFO: [backy2.logging] $ /usr/bin/backy2 rm 54f4ad4e-31fb-11eb-9506-b02628295df0
    INFO: [backy2.logging] Removed backup version 54f4ad4e-31fb-11eb-9506-b02628295df0 with 5120 blocks.
    INFO: [backy2.logging] Backy complete.

    INFO: [backy2.logging] $ /usr/bin/backy2 cleanup
    INFO: [backy2.logging] Deleting false positives...
    INFO: [backy2.logging] Deleting false positives: done. Now deleting blocks.
    INFO: [backy2.logging] 0 delete candidate blocks found.
    INFO: [backy2.logging] Deleted 0 blocks.
    INFO: [backy2.logging] Backy complete.
aborrero@cloudvirt1024:~ $ df -h
Filesystem             Size  Used Avail Use% Mounted on
udev                   252G     0  252G   0% /dev
tmpfs                   51G  3.0G   48G   6% /run
/dev/sda1               84G   38G   43G  47% /
tmpfs                  252G     0  252G   0% /dev/shm
tmpfs                  5.0M     0  5.0M   0% /run/lock
tmpfs                  252G     0  252G   0% /sys/fs/cgroup
/dev/mapper/tank-data   14T   13T  612G  96% /srv
tmpfs                   51G     0   51G   0% /run/user/18194
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

I just found this in the backy2 documentation:

In order to provide parallelism (i.e. multiple backy2 processes at the same time), backy2 needs to prevent race-conditions between adding a delete-candidate to the list and actually removing its data. That’s why a cleanup will only remove data blocks once they’re on the list of delete-candidates for more than 1 hour.

So I might try again the cleanup in 1 hour to see if it actually deletes data.

Mentioned in SAL (#wikimedia-cloud) [2020-12-04T11:25:40Z] <arturo> icinga downtime cloudvirt1024 for 6 days, to avoid paging noises (T269419)

Mentioned in SAL (#wikimedia-cloud) [2020-12-04T12:12:49Z] <arturo> manually running wmcs-purge-backups again on cloudvirt1024 (T269419)

aborrero lowered the priority of this task from High to Medium.Dec 4 2020, 12:14 PM

Well this time, after an hour passed, it seems I managed to delete some data:

aborrero@cloudvirt1024:~ $ sudo wmcs-purge-backups
[...]
    INFO: [backy2.logging] Deleted 178615 blocks.
    INFO: [backy2.logging] Backy complete.
aborrero@cloudvirt1024:~ $ df -h
Filesystem             Size  Used Avail Use% Mounted on
udev                   252G     0  252G   0% /dev
tmpfs                   51G  3.0G   48G   6% /run
/dev/sda1               84G   38G   43G  47% /
tmpfs                  252G     0  252G   0% /dev/shm
tmpfs                  5.0M     0  5.0M   0% /run/lock
tmpfs                  252G     0  252G   0% /sys/fs/cgroup
/dev/mapper/tank-data   14T   13T  1.1T  93% /srv
tmpfs                   51G     0   51G   0% /run/user/18194

This is probably a result of actual growth in the backed-up projects. We should have gotten warned at 80% and alerted at 90% though -- I wonder why we're only hearing about this at 99%?

Change 645368 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs instance backups: rearrange backups and reduce storage time

https://gerrit.wikimedia.org/r/645368

Change 645368 merged by Andrew Bogott:
[operations/puppet@production] wmcs instance backups: rearrange backups and reduce storage time

https://gerrit.wikimedia.org/r/645368

This happened again now, rerun the purge script:

dcaro@cloudvirt1024:~$ sudo wmcs-purge-backups
...
    INFO: [backy2.logging] Deleted 582800 blocks.
    INFO: [backy2.logging] Backy complete.

That freed some space:

dcaro@cloudvirt1024:~$ df -h
Filesystem             Size  Used Avail Use% Mounted on
udev                   252G     0  252G   0% /dev
tmpfs                   51G  3.2G   48G   7% /run
/dev/sda1               84G   38G   43G  48% /
tmpfs                  252G     0  252G   0% /dev/shm
tmpfs                  5.0M     0  5.0M   0% /run/lock
tmpfs                  252G     0  252G   0% /sys/fs/cgroup
/dev/mapper/tank-data   14T   12T  2.3T  84% /srv
tmpfs                   51G     0   51G   0% /run/user/25603

Mentioned in SAL (#wikimedia-cloud) [2020-12-10T11:56:10Z] <dcaro> Freed some space on cloudvirt1024 by running the purge script (T269419)

Mentioned in SAL (#wikimedia-cloud) [2020-12-13T09:11:06Z] <_dcaro> running backup purge script on cloudvirt1024 (T269419)

Mentioned in SAL (#wikimedia-operations) [2020-12-14T09:45:02Z] <aborrero@cumin1001> START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on cloudvirt1024.eqiad.wmnet with reason: T269419

Mentioned in SAL (#wikimedia-operations) [2020-12-14T09:45:05Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on cloudvirt1024.eqiad.wmnet with reason: T269419

Mentioned in SAL (#wikimedia-cloud) [2020-12-14T09:45:21Z] <arturo> icinga downtime cloudvirt1024 for 6 days (T269419)

Mentioned in SAL (#wikimedia-cloud) [2020-12-14T17:36:27Z] <dcaro> removing invalid backups that have a valid copy (T269419)

Mentioned in SAL (#wikimedia-cloud) [2020-12-14T17:41:59Z] <dcaro> The removal freed ~12GB (still 100% usage :S) (T269419)

Mentioned in SAL (#wikimedia-cloud) [2020-12-16T09:31:23Z] <dcaro> removing invalid backups from cloudvirt1024 (196 in total) (T269419)

Some preliminar info about what's being backed up, note that the sizes are not
taking into account the deduplicated bits (that coplicates things quite a bit):

Total Size: 14853120MB
Number of projects: 114
Top 10 projects by size:
    16.34% - dumps
    6.83% - wikidumpparse
    2.59% - devtools
    1.72% - cvn
    1.55% - wikidata-history-query-service
    1.45% - rcm
    1.45% - osmit
    1.45% - reading-web-staging
    1.38% - codereview
    1.38% - videowiki
Top 10 VMs by size:
    5.79% - humaniki-prod(1f7a940f-e3b1-480b-84ba-dd2fa6855755)
    4.31% - dumps-5(590da924-9d2d-4c43-a5b4-7f502ff209df)
    3.45% - dumps-4(287bd5f0-20c7-4623-ba99-4865d56f8ac7)
    3.45% - dumps-3(a8ed02b4-7f80-4cd9-8310-a86f72b374d4)
    2.59% - dumps-1(303b87ba-0294-4f87-ae69-0026408fa068)
    1.72% - dumps-2(2b6786ca-ffa8-459e-82ad-c36435ea0f7e)
    1.55% - wdhqs-1(59e95aa5-ae70-47b7-9af8-f5c3f913a992)
    1.38% - crm-wikimania(84fdcc52-fd07-42ab-bf4c-14273378dcbc)
    1.38% - pst(8beacc89-d604-4fa3-af56-2e1bbc704854)
    0.83% - pub2(035e1323-825a-486e-b355-2a296ae7f540)

Change 650178 had a related patch set uploaded (by David Caro; owner: David Caro):
[operations/puppet@production] [wmcs] Move some heavy backups to cloudvirt1026

https://gerrit.wikimedia.org/r/650178

Change 650178 merged by David Caro:
[operations/puppet@production] [wmcs] Move some heavy backups to cloudvirt1026

https://gerrit.wikimedia.org/r/650178

Change 650776 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Keystone: disable logging to /var/log/keystone/

https://gerrit.wikimedia.org/r/650776

Backups for dumps and wikidumpparse moved to cloudvirt1026, now both are at ~75%, closing this