Page MenuHomePhabricator

CI backups on contint1001 generating 6GB of file metadata- not happening before- potentially slowing down or making impossible a recovery
Closed, ResolvedPublic

Description

Currently, contint1001.wikimedia.org-Monthly-1st-Sun-production-contint job is running. I detected an unusual amount of disk space used on backup1001, which is weird, because no backup data should happen on the bacula director.

I checked and the cause seems to be a huge amount of file attributes being generated on the director (>17 million), which will probably be stored on the database for later recovery. While a large amount of files normally can be scaled to, no problem, this seems to be an unusual increase in the amount of files backed up, to the point of being noticeable on backup1001 disk space graphs (backup1001 doesn't store any file content!). This is new, and didn't happen the last time a full backup of contint1001 was run. A huge amount of metadata (normally due to a large amount of files backed up) is not an issue by itself, but it probably will require a large amount of time to recover from it.

Because this is a new pattern (almost 6 GB of file metadata-just the titles and timestamps) I wonder if this is accidental. Maybe backups are happening of paths that are not intended, a new service is being backed up, or a large amount of files has been accidentally generated. In any case, either removing the file in origin (if they happen to be leftovers), tarring (archiving) old unused files, creating ignore filters on backups for non useful files, or splitting the backup process into a few independent paths may help speeding up (or in a worse case scenario, making practically possible) later recoveries.

For example, s3 databases usually would have hundreds of thousands of files, one per table and per database, so knowing that most files will only be required individual access per databas, we tar them in just a few thousand files, which are faster to backup and recover later, rather than parsing many files (which can take a lot, as they have to be written and read to the database).

This requires research first.

Event Timeline

I thought it could have been due to Jenkins build history, but they are stored on the primary server which is contint2001 currently.

contint1001 is a Jenkins agent and its main usage is builder Docker images for the Release Pipeline . Inodes usage shows /srv/ has 19M inodes used, which is consistent with the surge of file attributes on the director?

contint1001:~$ df -hi
Filesystem           Inodes IUsed IFree IUse% Mounted on
udev                   7.9M   518  7.9M    1% /dev
tmpfs                  7.9M   987  7.9M    1% /run
/dev/mapper/vg0-root   4.7M  124K  4.6M    3% /
tmpfs                  7.9M     2  7.9M    1% /dev/shm
tmpfs                  7.9M     4  7.9M    1% /run/lock
tmpfs                  7.9M    17  7.9M    1% /sys/fs/cgroup
/dev/mapper/vg0-srv     89M   19M   71M   21% /srv
tmpfs                  7.9M    13  7.9M    1% /run/user/498
tmpfs                  7.9M    11  7.9M    1% /run/user/1010

Docker uses overlay2, so each containers has a fully expanded image on disk with each files represented individually. There are 533 left over containers there that should have been ripped off. We previously had the issue of containers being left being after build ( T235680#6320780 ), I guess that is the same issue happening again but it is a side track.

contint1001 has role ci::master which loads profile::ci::backup which eventually refer to:

bacula::director::fileset { 'contint':
    includes => [ '/srv', '/var/lib/zuul', '/var/lib/jenkins' ],
    excludes => [ '/srv/jenkins/builds', '/var/lib/jenkins/builds', ],
}

The Jenkins builds are excluded since we can afford to loose them. The same should be done for the Docker partition (images we care about are published to the Docker registry, anything left is transient).

Change 719466 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] contint: do not backup /srv/docker

https://gerrit.wikimedia.org/r/719466

hashar triaged this task as Medium priority.

Thanks for the research, this ticket was more of "I've seen something weird going on CI hosts, so I am going to raise it", so I appreciate the quick response and debugging.

Change 719466 merged by Dzahn:

[operations/puppet@production] contint: do not backup /srv/docker

https://gerrit.wikimedia.org/r/719466

@Dzahn merged the change and /srv/docker should be ignored from now on. @jcrespo do you have a way to trigger a full backup to confirm the millions of extra files are gone? Else we would have to wait for the next monthly backup.

I can (run command), but given we have recently generated a huge full backup, I would prefer to avoid generating another very large so soon, and instead assume the issue is fixed, resolve this ticket, and I will ping you/reopen this task if it happens again next month.

Sounds good to me. Thank you @jcrespo :]