Page MenuHomePhabricator

L10n cache files building up on backup deploy hosts
Open, LowPublic


Files under /srv/mediawiki-staging are kept in sync between the current deploy host and backup hosts via a cronjob in operations/puppet (scap-master-sync). This job explicitly excludes l10n cache files which now take up a large amount of space on deploy2001.codfw.wmnet.

Do we need to be excluding these files? Perhaps not.

dduvall@deploy2001:/srv/mediawiki-staging$ du -ch php-* | tail -n 1
213G	total
dduvall@deploy2001:/srv/mediawiki-staging$ du -sh php-* | sort -nk 2
1.7G	php-1.32.0-wmf.10
1.7G	php-1.32.0-wmf.12
1.7G	php-1.32.0-wmf.13
1.7G	php-1.32.0-wmf.14
1.7G	php-1.32.0-wmf.15
1.7G	php-1.32.0-wmf.16
1.7G	php-1.32.0-wmf.18
1.7G	php-1.32.0-wmf.19
1.7G	php-1.32.0-wmf.20
1.7G	php-1.32.0-wmf.22
1.7G	php-1.32.0-wmf.23
1.7G	php-1.32.0-wmf.24
1.7G	php-1.32.0-wmf.26
1.7G	php-1.32.0-wmf.4
1.7G	php-1.32.0-wmf.5
1.7G	php-1.32.0-wmf.6
1.7G	php-1.32.0-wmf.7
1.7G	php-1.32.0-wmf.8
1.7G	php-1.33.0-wmf.1
1.7G	php-1.33.0-wmf.12
1.7G	php-1.33.0-wmf.13
1.7G	php-1.33.0-wmf.14
1.7G	php-1.33.0-wmf.16
1.7G	php-1.33.0-wmf.17
1.7G	php-1.33.0-wmf.18
1.7G	php-1.33.0-wmf.19
1.7G	php-1.33.0-wmf.2
1.7G	php-1.33.0-wmf.20
1.7G	php-1.33.0-wmf.21
1.7G	php-1.33.0-wmf.22
1.7G	php-1.33.0-wmf.23
1.7G	php-1.33.0-wmf.24
1.7G	php-1.33.0-wmf.25
1.7G	php-1.33.0-wmf.3
1.7G	php-1.33.0-wmf.4
1.7G	php-1.33.0-wmf.6
1.7G	php-1.33.0-wmf.8
1.7G	php-1.34.0-wmf.1
1.8G	php-1.34.0-wmf.10
1.8G	php-1.34.0-wmf.11
1.8G	php-1.34.0-wmf.13
1.8G	php-1.34.0-wmf.14
1.8G	php-1.34.0-wmf.15
1.8G	php-1.34.0-wmf.16
1.8G	php-1.34.0-wmf.17
1.8G	php-1.34.0-wmf.19
1.8G	php-1.34.0-wmf.20
1.8G	php-1.34.0-wmf.21
1.8G	php-1.34.0-wmf.22
1.8G	php-1.34.0-wmf.23
1.8G	php-1.34.0-wmf.24
1.8G	php-1.34.0-wmf.25
1.8G	php-1.34.0-wmf.3
1.8G	php-1.34.0-wmf.4
1.8G	php-1.34.0-wmf.5
1.8G	php-1.34.0-wmf.6
1.8G	php-1.34.0-wmf.7
1.8G	php-1.34.0-wmf.8
1.8G	php-1.35.0-wmf.1
1.8G	php-1.35.0-wmf.10
1.8G	php-1.35.0-wmf.11
1.8G	php-1.35.0-wmf.14
1.8G	php-1.35.0-wmf.15
1.8G	php-1.35.0-wmf.16
1.8G	php-1.35.0-wmf.18
1.8G	php-1.35.0-wmf.19
1.8G	php-1.35.0-wmf.2
1.8G	php-1.35.0-wmf.20
1.8G	php-1.35.0-wmf.21
1.8G	php-1.35.0-wmf.22
1.8G	php-1.35.0-wmf.23
1.8G	php-1.35.0-wmf.24
1.8G	php-1.35.0-wmf.25
1.8G	php-1.35.0-wmf.26
1.8G	php-1.35.0-wmf.27
1.8G	php-1.35.0-wmf.3
1.8G	php-1.35.0-wmf.4
1.8G	php-1.35.0-wmf.5
1.8G	php-1.35.0-wmf.8
1.9G	php-1.35.0-wmf.28
1.9G	php-1.35.0-wmf.30
1.9G	php-1.35.0-wmf.31
1.9G	php-1.35.0-wmf.32
1.9G	php-1.35.0-wmf.34
1.9G	php-1.35.0-wmf.35
1.9G	php-1.35.0-wmf.36
1.9G	php-1.35.0-wmf.37
1.9G	php-1.35.0-wmf.38
1.9G	php-1.35.0-wmf.39
1.9G	php-1.35.0-wmf.40
1.9G	php-1.35.0-wmf.41
1.9G	php-1.36.0-wmf.1
1.9G	php-1.36.0-wmf.10
1.9G	php-1.36.0-wmf.11
1.9G	php-1.36.0-wmf.13
1.9G	php-1.36.0-wmf.14
1.9G	php-1.36.0-wmf.16
1.9G	php-1.36.0-wmf.18
1.9G	php-1.36.0-wmf.2
1.9G	php-1.36.0-wmf.20
1.9G	php-1.36.0-wmf.21
1.9G	php-1.36.0-wmf.3
1.9G	php-1.36.0-wmf.4
1.9G	php-1.36.0-wmf.5
1.9G	php-1.36.0-wmf.6
1.9G	php-1.36.0-wmf.8
1.9G	php-1.36.0-wmf.9
2.0G	php-1.36.0-wmf.22
2.0G	php-1.36.0-wmf.25
2.0G	php-1.36.0-wmf.26
2.0G	php-1.36.0-wmf.27
2.0G	php-1.36.0-wmf.28
2.0G	php-1.36.0-wmf.29
2.0G	php-1.36.0-wmf.30
6.2G	php-1.36.0-wmf.32
6.3G	php-1.36.0-wmf.31

Event Timeline

The /usr/local/bin/scap-master-sync command syncs /srv/mediawiki-staging and /srv/patches _from_ another deployment server.

The following rsync options are used:

15     --archive --delete-delay --delay-updates --compress --delete \
16     --exclude="**/cache/l10n/*.cdb" \
17     --exclude="*.swp" \

Running that on deploy2001 to pull from deploy1001 showed a lot of this:

cannot delete non-empty directory: php-1.32.0-wmf.20/cache/l10n
cannot delete non-empty directory: php-1.32.0-wmf.20/cache/l10n
cannot delete non-empty directory: php-1.32.0-wmf.20/cache
cannot delete non-empty directory: php-1.32.0-wmf.20/cache
cannot delete non-empty directory: php-1.32.0-wmf.20
< mutante> !log deploy2001 - scap-master-sync from deploy1001 runs and attempts to --delete files to stay in sync but fails to do so because *.cdb files are in cache dirs and rsync 
does not want to delete non-empty directories, this leads to build up of the size of /srv/mediawiki-staging to 10 times the size of eqiad

< mutante> !log deploy2001 2/2 - because rsync is --delete but also --exclude="**/cache/l10n/*.cdb" --exclude="*.swp"  you can't expect /srv/mediawiki-staging to be the same size on 2 servers

This came up in the context of creating new deployment servers on buster (deploy1002, deploy2002 -> T265963) and wanting to keep all the /srv/ data in sync before switching over.

For /srv/deployment there is already a fully automatic setup in puppet with rsync --delete that does not exclude files, so it is really identical on all (currently 4) servers and they all pull from deploy1001.

Just for /srv/mediawiki, /srv/mediawiki-staging and /srv/patches it was not. While scap-master-sync does the latter 2 it fails to delete old things because of the issue described here and for /srv/patches it works but there is nothing that runs it automatically.

For that last part I made so that /srv/patches should be like /srv/deployments.

Mentioned in SAL (#wikimedia-operations) [2021-02-26T20:29:48Z] <mutante> deploy2001 - /srv/mediawiki-staging sudo find . -name *.cdb delete - deleted 190 GB of old cdb files (T275826 T265963)

Not very high prio now that I manually deleted old stuff, but still should be fixed for the future to not build up again.

Change 667919 had a related patch set uploaded (by Ahmon Dancy; owner: Ahmon Dancy):
[operations/puppet@production] scap-master-sync: Don't exclude CDB files

Change 667919 merged by Dzahn:
[operations/puppet@production] scap-master-sync: Don't exclude CDB files

Change 667919 merged by Dzahn:
[operations/puppet@production] scap-master-sync: Don't exclude CDB files

Reverted by, as that broke all deployments, see P14572.

@Dzahn If scap-master-sync is what it used to keep deploy2002 with deploy1002, and scap-master-sync excludes */cache/l10n/*.cdb, how are these files ending up on deploy2002?

@Dzahn If scap-master-sync is what it used to keep deploy2002 with deploy1002, and scap-master-sync excludes */cache/l10n/*.cdb, how are these files ending up on deploy2002?

Self-answering: scap causes this to happen during sync-world.

Mentioned in SAL (#wikimedia-operations) [2021-03-19T18:46:09Z] <mutante> deploy2002 - disable puppet, copy modified version of scap-master-sync over it that does not --exclude="**/cache/l10n/*.cdb" (for T275826)

I think maybe the reason these are not backed up in this way is because they're large binaries that are slow to sync, and they are automatically re-created by Scap anyway if/when the other host comes into use, right?

During regular syncs, Scap indeed does instruct all destinations (both app servers, and presumably the other deployhost as well) to create these files as derivates of the json files. So in general they wouldn't actually be missing at all.

I guess the only problem for this ticket, then, might be not so much the syncing of these files, but more specifically the deletion of these files. If deletion of l10n caches is generally a deployhost-local action, then indeed these would linger on the other deploy hosts. But.. is that actually true? I would expect the deletion of such files to also be applied to all appserver destinations, right? How come the deployhost isn't getting the deletion event at the "regular" time?

Aye, I missed the /srv/mediawiki vs mediawiki-staging distinction. This issue is specifically about mediawiki-staging. That is indeed not a Scap destination, so wouldn't be normally picked up. The question then is, why do we create them in the first place on a deploy host, and I guess the answer there is for local testing and so that mwscript and such can be used on the deployhost based on the staged code without having to deploy to the fleet first and without having to e.g. scap pull to yourself (which last I checked, is not supported on a deployhost anyway).

So yeah, syncing that and deleting them as-needed automatically seems like a useful thing to do. I can't think of a reason not to, especially since it doesn't happen continuously and presumably speed isn't a major concern for such background task?

fwiw, this isn't just "on deploy hosts", this is also on individual appservers. for example just did a scap pull on mw2255 after it had hardware maintenance to bring it up to current version before repooling and still getting a bunch of "could not delete empty directory" messages from rsync while that is happening. and it's one directory for each old MW version and inside are the L10n cdb files mentioned here.

We proceeded with the wider work without fixing this task, so I'll remove it as a blocker.