During scap sync-world scap does the following things to prepare/update L10N files for each active wiki version:
- If CDB files already exist for a wiki version (e.g., they're from the prior train), they are all copied into a subdirectory of /tmp. This is 2.1GB of copied data.
- rebuildLocalisationCache.php is executed, pointing at the temp directory. If CDB files exist there (because they were copied in the prior step), they are scanned to ensure that they are up to date. Files that don't need to change remain unaffected and files that do need to change are reconstructed and replaced atomically (on an individual basis). This is a very quick operation if the source files haven't changed (the usual case). If CDB files do not exist, they are created in the temp dir using multiple threads. This takes a minute or two to complete.
- The CDB files in the temp dir are copied (as the l10nupdate user) to /srv/mediawiki-staging/php-<vers>/cache/l10n and the temp directory is deleted. In the usual case where last week's wiki version is still active, this results in copying 2.1GB of unchanged files back to where they originally resided.
- The finalized CDB files are read to generate their contents in JSON format, resulting in 2.2GB Of JSON files being created. Additional information is saved to avoid recreating a JSON file if its associated CDB file hasn't changed.
- Target nodes are instructed to pull from the deploy server (and/or proxies).
- Target nodes are instructed to run scap cdb-rebuild. This causes a node to read the JSON l10n files to generate CDB files on the node.
Problems with this process:
- Copying CDB files back and forth from /tmp is inefficient and unnecessary. They can be processed in-place. rebuildLocalisationCache.php operates sanely and never updates a CDB file in place. It always writes to a new file and performs an atomic rename to update a file.
- There is no need to create scap-specific JSON versions of CDB files. All of the information needed to generate the CDB files is available on target nodes (the l10n .json files in the source code). Target nodes can just run rebuildLocalisationCache.php to rebuild CDB files.
Proposal:
- Don't copy CDB files to /tmp on the deploy server. This has been implemented in scap and an updated scap has been deployed to beta cluster. After ironing out a few issues it looks good and the beta-scap-sync-world jobs runs in the lowest amount of time it ever has.
-
Change scap cdb-rebuild to run rebuildLocalisationCache.php.This didn't work out due to fact that rebuildLocalisationCache.php must be run as www-data but the output files must be owned by mwdeploy. This could be worked around by running it as www-data and then copying the resulting files as the mwdeploy user but that works against the "increase efficiency" goal of this ticket.