Page MenuHomePhabricator

Add cache key information to metadata json
Open, MediumPublic

Description

			// Input for cache key
			$cacheOptions = [
				'code' => $code,
				'lang' => $options['lang'],
				'note-language' => $options['note-language'],
				'raw'  => $options['raw'],
				'ExtVersion' => self::CACHE_VERSION,
				'LyVersion' => self::getLilypondVersion(),
			];

			$imageCacheName = Wikimedia\base_convert( sha1( serialize( $cacheOptions ) ), 16, 36, 31 );

The above is for generating the cache key... Which is all well and good, but we have no way of knowing from the files left on disk what version of lilypond generated it, for example...

So when lilypond is upgraded, files are regenerated, and the old files stay on disk forever (currently), with no easy way to know which are the old or new versions

Event Timeline

Krinkle added subscribers: fgiunchedi, Krinkle.

+1 for adding the CACHE_VERSION to the file path. That way, after parser cache rollsover during 30 days, all previous files can be wiped, knowing that any page view for one would be a miss and regenerate it as needed.

Score currently has its own custom way of directory sharding (using 1+1 character subdirs instead of 1+2 chars). I don't know if this still matters or not. My limited understanding of Swift suggests that maybe it doesn't matter and that any iteration over all files from a given wiki or extension would go by overall "score/" prefix and thus perhaps not matter whether the name contains slashes. The same seems true for other large wiki farms as well and other object stores. If NFS is still considered a reasonable middleground for medium-size wikis outside WMF we could keep it I guess. But maybe it's not needed?

@fgiunchedi Given an upcoming change to Score's files stored in Swift, do you have a preference for whether and how much subdirectory sharding is applied to its files? E.g. would score/cache23/abcdef0123456789/abcdef.png be fine, or would you prefer something more like score/cache23/a/ab/abcdef0123456789/abcdef.png or score/a/ab/abcdef0123456789_cache23/abcdef.png.

Thank you for reaching out @Krinkle, your understanding is correct: within a container what matters for iteration over files is the prefix.

In this case I think it makes sense to go score/<version>/a/ab/abcdef0123456789/abcdef.<type>. Both to be able to operate in batches on a version and keep symmetry with media storage with 1+2 scheme.

Not strictly in scope for this task but I thought I'd mention it, score is a single container AFAICT (global-data-score-render) with 3.8M files and 300G (~78k average file size). If performance becomes a problem we can also consider sharding the container into multiple ones like big wikis e.g. global-data-score-render.aa (for comparison commons thumbs containers have average ~128k file size, at ~7M files and ~800G per container). In that case the two letter sharding comes handy to match media storage.

So when lilypond is upgraded, files are regenerated, and the old files stay on disk forever (currently), with no easy way to know which are the old or new versions

I note this isn't currently true. I created rESCRf30683d0d935: Add maintenance script to delete old Score files as part of the issues to clear things up. I never wired it into puppet for automated runs (because Score has been disabled; so wasn't creating any more files at the time).

I'm just running it now (as of writing; but was this morning) upto 20200101000000 (which I'm sure I did before)... Will report back before doing it to a newer date too. Might be worth looking at stats again at that pooint

fgiunchedi triaged this task as Medium priority.Aug 30 2021, 8:08 AM

Ok, I have now deleted everything before 20210101000000

reedy@mwmaint1002:~$ mwscript extensions/Score/maintenance/GetLYFiles.php --wiki=enwiki --date=20210101000000
Total files (all extensions): 251527
53321 ly files created on or after 20210101000000

Not strictly in scope for this task but I thought I'd mention it, score is a single container AFAICT (global-data-score-render) with 3.8M files and 300G (~78k average file size). If performance becomes a problem we can also consider sharding the container into multiple ones like big wikis e.g. global-data-score-render.aa (for comparison commons thumbs containers have average ~128k file size, at ~7M files and ~800G per container). In that case the two letter sharding comes handy to match media storage.

Be curious to see some updated stats on this now

Ok, I have now deleted everything before 20210101000000

reedy@mwmaint1002:~$ mwscript extensions/Score/maintenance/GetLYFiles.php --wiki=enwiki --date=20210101000000
Total files (all extensions): 251527
53321 ly files created on or after 20210101000000

Not strictly in scope for this task but I thought I'd mention it, score is a single container AFAICT (global-data-score-render) with 3.8M files and 300G (~78k average file size). If performance becomes a problem we can also consider sharding the container into multiple ones like big wikis e.g. global-data-score-render.aa (for comparison commons thumbs containers have average ~128k file size, at ~7M files and ~800G per container). In that case the two letter sharding comes handy to match media storage.

Be curious to see some updated stats on this now

We're down to 200k files and 15G:

$ swift stat global-data-score-render
       Container: global-data-score-render
       Objects: 251960
       Bytes: 15426769106

(available with mw swift credentials, and the swift python client)