Page MenuHomePhabricator

Automatically clean up unused wmfXX versions
Closed, ResolvedPublic

Description

Old versions take up space. Not only code, but also l10n files. Eg: In the 1.23wmf17 version checkout the full l10n cache (CDB and JSON files) consumes 1.6G of disk space.

Remembering to do this is annoying and error prone. It should be automated.

Bryan started the work with https://gerrit.wikimedia.org/r/#/c/118337/ "Add script to cleanup l10n cache for an inactive MediaWiki version"


Version: wmf-deployment
Severity: normal

Related Objects

StatusSubtypeAssignedTask
DeclinedNone
Resolved demon
Declined mmodell
InvalidNone
Resolved mmodell
ResolvedJdforrester-WMF
Declined mmodell
Resolved mmodell
Resolved mmodell
Resolved mmodell
Resolveddduvall
ResolvedKrinkle
ResolvedKrinkle
Resolved mmodell
DuplicateKrinkle
ResolvedKrinkle
ResolvedKrinkle
ResolvedPRODUCTION ERRORMaxSem
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:57 AM
bzimport added a project: Deployments.
bzimport set Reference to bz71313.
bzimport added a subscriber: Unknown Object (MLST).

This is easier now than before, eg: scap clean 1.28.0-wmf.9

This is easier now than before, eg: scap clean 1.28.0-wmf.9

I'm inclined to actually close this resolved. This probably shouldn't be fully automated and with the logic properly hidden it's trivial to do the work.

I'm inclined to actually close this resolved. This probably shouldn't be fully automated and with the logic properly hidden it's trivial to do the work.

It does seem like things are getting cleaned up reasonably right now. the .5 and .6 branches could probably be killed (our anon cache window is ~30 days now right?), but there isn't a gross amount of old versions hanging out.

tin:~
bd808$ ls -ldthr /srv/mediawiki-staging/php-*
drwxrwxr-x 16 demon           wikidev 4.0K Dec 15 14:15 /srv/mediawiki-staging/php-1.29.0-wmf.5/
drwxrwxr-x 16 twentyafterfour wikidev 4.0K Jan  5 00:33 /srv/mediawiki-staging/php-1.29.0-wmf.6/
drwxrwsr-x 16 thcipriani      wikidev 4.0K Jan 18 15:08 /srv/mediawiki-staging/php-1.29.0-wmf.7/
drwxrwsr-x 16 demon           wikidev 4.0K Jan 21 19:58 /srv/mediawiki-staging/php-1.29.0-wmf.8/
drwxrwsr-x 16 twentyafterfour wikidev 4.0K Feb  1 14:21 /srv/mediawiki-staging/php-1.29.0-wmf.9/
drwxrwsr-x 16 twentyafterfour wikidev 4.0K Feb  6 20:34 /srv/mediawiki-staging/php-1.29.0-wmf.10/

Back in the olden days the big space hog that caused issues on hosts with smaller disks was the l10n cache files. It looks like those are still hanging out much longer than needed:

tin:~
bd808$ ls -ldthr /srv/mediawiki-staging/php-*/cache/l10n/upstream/l10n_cache-en.cdb.json
-rw-r--r-- 1 l10nupdate l10nupdate 3.2M Dec 15 01:31 /srv/mediawiki-staging/php-1.29.0-wmf.5/cache/l10n/upstream/l10n_cache-en.cdb.json
-rw-r--r-- 1 l10nupdate l10nupdate 3.3M Jan  5 02:18 /srv/mediawiki-staging/php-1.29.0-wmf.6/cache/l10n/upstream/l10n_cache-en.cdb.json
-rw-r--r-- 1 l10nupdate l10nupdate 3.3M Jan 18 20:13 /srv/mediawiki-staging/php-1.29.0-wmf.7/cache/l10n/upstream/l10n_cache-en.cdb.json
-rw-r--r-- 1 l10nupdate l10nupdate 3.3M Jan 23 02:44 /srv/mediawiki-staging/php-1.29.0-wmf.8/cache/l10n/upstream/l10n_cache-en.cdb.json
-rw-r--r-- 1 l10nupdate l10nupdate 3.3M Feb  2 02:20 /srv/mediawiki-staging/php-1.29.0-wmf.9/cache/l10n/upstream/l10n_cache-en.cdb.json
-rw-r--r-- 1 l10nupdate l10nupdate 3.3M Feb  7 02:18 /srv/mediawiki-staging/php-1.29.0-wmf.10/cache/l10n/upstream/l10n_cache-en.cdb.json

A full scap checks all of these files in the rsync. Getting rid of them sooner rather than later takes a small bit of load off of all of the MW hosts during a scap and really can speed things up in a measurable way. "Soon" we will have git transport for all of this which should make that mostly moot however by using pre-computed diffs.

I agree that automatic cleanup is tricky though and probably harder to get right at this point than is worth while. Deciding if a branch purge is safe takes knowledge of the last time each branch would have been associated with an anon page view that was added to Varnish and some idea of the reasonable maximum time that Varnish should be holding on to HTML that has a branch versioned reference to static content in it.

I'm inclined to actually close this resolved. This probably shouldn't be fully automated and with the logic properly hidden it's trivial to do the work.

It does seem like things are getting cleaned up reasonably right now. the .5 and .6 branches could probably be killed (our anon cache window is ~30 days now right?), but there isn't a gross amount of old versions hanging out.

The current practice is retaining the previous 5 branches. That's technically 35 days of going back. 4 branch only puts us with 28 days. It's probably fine but I've been paranoid up to now. Lowering that cache TTL to 28 days would make it an even 4 weeks (which is easier to count than 30 days tbh). Cf T140921: Reduce static asset time on disk from five trains' worth to two

Back in the olden days the big space hog that caused issues on hosts with smaller disks was the l10n cache files. It looks like those are still hanging out much longer than needed:

tin:~
bd808$ ls -ldthr /srv/mediawiki-staging/php-*/cache/l10n/upstream/l10n_cache-en.cdb.json
-rw-r--r-- 1 l10nupdate l10nupdate 3.2M Dec 15 01:31 /srv/mediawiki-staging/php-1.29.0-wmf.5/cache/l10n/upstream/l10n_cache-en.cdb.json
-rw-r--r-- 1 l10nupdate l10nupdate 3.3M Jan  5 02:18 /srv/mediawiki-staging/php-1.29.0-wmf.6/cache/l10n/upstream/l10n_cache-en.cdb.json
-rw-r--r-- 1 l10nupdate l10nupdate 3.3M Jan 18 20:13 /srv/mediawiki-staging/php-1.29.0-wmf.7/cache/l10n/upstream/l10n_cache-en.cdb.json
-rw-r--r-- 1 l10nupdate l10nupdate 3.3M Jan 23 02:44 /srv/mediawiki-staging/php-1.29.0-wmf.8/cache/l10n/upstream/l10n_cache-en.cdb.json
-rw-r--r-- 1 l10nupdate l10nupdate 3.3M Feb  2 02:20 /srv/mediawiki-staging/php-1.29.0-wmf.9/cache/l10n/upstream/l10n_cache-en.cdb.json
-rw-r--r-- 1 l10nupdate l10nupdate 3.3M Feb  7 02:18 /srv/mediawiki-staging/php-1.29.0-wmf.10/cache/l10n/upstream/l10n_cache-en.cdb.json

A full scap checks all of these files in the rsync. Getting rid of them sooner rather than later takes a small bit of load off of all of the MW hosts during a scap and really can speed things up in a measurable way. "Soon" we will have git transport for all of this which should make that mostly moot however by using pre-computed diffs.

So there's an option in scap clean called scap clean --l10n-only. This should be documented--when I trimmed down the instructions I ended up removing a bit more than I should. I'll fix that up in a bit.

I agree that automatic cleanup is tricky though and probably harder to get right at this point than is worth while. Deciding if a branch purge is safe takes knowledge of the last time each branch would have been associated with an anon page view that was added to Varnish and some idea of the reasonable maximum time that Varnish should be holding on to HTML that has a branch versioned reference to static content in it.

So, my thought was encoding this logic in the clean plugin but erring on the side of caution. Do it based on the # of branches, not dates. "If the branch is older than $current - 5, delete; if the branch is older than the last one, delete i18n" This would be correct most of the time, but would keep us from breaking when we kept a branch for 2 (or more) weeks. This would also mean a deployer just types scap clean and it DWIM.

Change 336730 had a related patch set uploaded (by Chad):
Scap clean: Rework --l10n-only into --keep-static

https://gerrit.wikimedia.org/r/336730

Change 336730 merged by jenkins-bot:
Scap clean: Rework --l10n-only into --keep-static

https://gerrit.wikimedia.org/r/336730

scap clean does all this. New bugs should be opened if there's issues with it.

demon claimed this task.