Page MenuHomePhabricator

Improve efficiency of scap l10n operations
Closed, ResolvedPublic5 Estimated Story Points

Description

During scap sync-world scap does the following things to prepare/update L10N files for each active wiki version:

  • If CDB files already exist for a wiki version (e.g., they're from the prior train), they are all copied into a subdirectory of /tmp. This is 2.1GB of copied data.
  • rebuildLocalisationCache.php is executed, pointing at the temp directory. If CDB files exist there (because they were copied in the prior step), they are scanned to ensure that they are up to date. Files that don't need to change remain unaffected and files that do need to change are reconstructed and replaced atomically (on an individual basis). This is a very quick operation if the source files haven't changed (the usual case). If CDB files do not exist, they are created in the temp dir using multiple threads. This takes a minute or two to complete.
  • The CDB files in the temp dir are copied (as the l10nupdate user) to /srv/mediawiki-staging/php-<vers>/cache/l10n and the temp directory is deleted. In the usual case where last week's wiki version is still active, this results in copying 2.1GB of unchanged files back to where they originally resided.
  • The finalized CDB files are read to generate their contents in JSON format, resulting in 2.2GB Of JSON files being created. Additional information is saved to avoid recreating a JSON file if its associated CDB file hasn't changed.
  • Target nodes are instructed to pull from the deploy server (and/or proxies).
  • Target nodes are instructed to run scap cdb-rebuild. This causes a node to read the JSON l10n files to generate CDB files on the node.

Problems with this process:

  • Copying CDB files back and forth from /tmp is inefficient and unnecessary. They can be processed in-place. rebuildLocalisationCache.php operates sanely and never updates a CDB file in place. It always writes to a new file and performs an atomic rename to update a file.
  • There is no need to create scap-specific JSON versions of CDB files. All of the information needed to generate the CDB files is available on target nodes (the l10n .json files in the source code). Target nodes can just run rebuildLocalisationCache.php to rebuild CDB files.

Proposal:

  • Don't copy CDB files to /tmp on the deploy server. This has been implemented in scap and an updated scap has been deployed to beta cluster. After ironing out a few issues it looks good and the beta-scap-sync-world jobs runs in the lowest amount of time it ever has.
  • Change scap cdb-rebuild to run rebuildLocalisationCache.php. This didn't work out due to fact that rebuildLocalisationCache.php must be run as www-data but the output files must be owned by mwdeploy. This could be worked around by running it as www-data and then copying the resulting files as the mwdeploy user but that works against the "increase efficiency" goal of this ticket.

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptNov 8 2021, 5:50 PM

Change 737495 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Add cdb_rebuild_using_rebuildLocalisationCache config option

https://gerrit.wikimedia.org/r/737495

Note that the cdb->json->rsync->json->cdb dance is expected to go away "soon" per T99740: Use static php array files for l10n cache at WMF (instead of CDB).

dancy changed the task status from Open to In Progress.Nov 10 2021, 8:55 PM

Change 738461 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] mediawiki: Ensure mwdeploy user is a member of the www-data group

https://gerrit.wikimedia.org/r/738461

Change 738453 had a related patch set uploaded (by Thcipriani; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Avoid copying L10N files from/to /tmp

https://gerrit.wikimedia.org/r/738453

Change 738954 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/train-dev@master] deploy: Ensure mwdeploy user is a member of the www-data group

https://gerrit.wikimedia.org/r/738954

Change 738453 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Avoid copying L10N files from/to /tmp on deploy server

https://gerrit.wikimedia.org/r/738453

Change 738453 merged by jenkins-bot:

[mediawiki/tools/scap@master] Avoid copying L10N files to/from /tmp on deploy server

https://gerrit.wikimedia.org/r/738453

Change 739040 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] sync_master: Use same CDB rebuild technique as main deploy server

https://gerrit.wikimedia.org/r/739040

Change 739040 merged by jenkins-bot:

[mediawiki/tools/scap@master] sync_master: Use same CDB rebuild technique as main deploy server

https://gerrit.wikimedia.org/r/739040

Change 739620 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] beta::autoupdater Don't mess with ${stage_dir}/php-master/cache/l10n

https://gerrit.wikimedia.org/r/739620

Change 739620 merged by Dzahn:

[operations/puppet@production] beta::autoupdater Don't mess with ${stage_dir}/php-master/cache/l10n

https://gerrit.wikimedia.org/r/739620

Change 737495 abandoned by Ahmon Dancy:

[mediawiki/tools/scap@master] Add cdb_rebuild_using_rebuildLocalisationCache config option

Reason:

not working out securitywise

https://gerrit.wikimedia.org/r/737495

Change 738461 abandoned by Ahmon Dancy:

[operations/puppet@production] mediawiki: Ensure mwdeploy user is a member of the www-data group

Reason:

Not working out securitywise

https://gerrit.wikimedia.org/r/738461

Change 738954 abandoned by Ahmon Dancy:

[mediawiki/tools/train-dev@master] deploy: Ensure mwdeploy user is a member of the www-data group

Reason:

https://gerrit.wikimedia.org/r/738954

dancy triaged this task as Medium priority.
dancy updated the task description. (Show Details)

Change 739874 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] scap clean: Delete cache/l10n separately

https://gerrit.wikimedia.org/r/739874

Change 739874 abandoned by Ahmon Dancy:

[mediawiki/tools/scap@master] scap clean: Delete cache/l10n separately

Reason:

only affects obsolete code.

https://gerrit.wikimedia.org/r/739874

Change 739907 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] scap clean / scap prep mods for T295304

https://gerrit.wikimedia.org/r/739907

Change 739907 merged by jenkins-bot:

[mediawiki/tools/scap@master] scap clean / scap prep mods for T295304

https://gerrit.wikimedia.org/r/739907

hashar reopened this task as Open.EditedNov 19 2021, 10:27 AM
hashar subscribed.

The deployment-prep job is broken since 11/19 06:35 UTC https://integration.wikimedia.org/ci/job/beta-scap-sync-world/:

06:35:13 Updating LocalisationCache for master using 6 thread(s)
06:35:31 Last output:
cp: cannot create regular file '/srv/mediawiki-staging/php-master/cache/l10n/l10n_cache-ab.cdb': Permission denied
cp: cannot create regular file '/srv/mediawiki-staging/php-master/cache/l10n/l10n_cache-abs.cdb': Permission denied
cp: cannot create regular file '/srv/mediawiki-staging/php-master/cache/l10n/l10n_cache-ace.cdb': Permission denied
cp: cannot create regular file '/srv/mediawiki-staging/php-master/cache/l10n/l10n_cache-ady-cyrl.cdb': Permission denied
cp: cannot create regular file '/srv/mediawiki-staging/php-master/cache/l10n/l10n_cache-ady.cdb': Permission denied

The l10n directory has drwxr-xr-x 3 www-data www-data

$ stat /srv/mediawiki-staging/php-master/cache/l10n/
  File: /srv/mediawiki-staging/php-master/cache/l10n/
  Size: 20480     	Blocks: 40         IO Block: 4096   directory
Device: fd00h/64768d	Inode: 1886732     Links: 3
Access: (0755/drwxr-xr-x)  Uid: (   33/www-data)   Gid: (   33/www-data)
Access: 2021-11-18 20:06:09.832989652 +0000
Modify: 2021-11-18 20:06:09.064930001 +0000
Change: 2021-11-18 20:06:09.064930001 +0000
 Birth: -

And a file is: -rw-r--r-- 1 www-data www-data

/srv/mediawiki-staging/php-master/cache/l10n/l10n_cache-ab.cdb
  File: /srv/mediawiki-staging/php-master/cache/l10n/l10n_cache-ab.cdb
  Size: 5767286   	Blocks: 11272      IO Block: 4096   regular file
Device: fd00h/64768d	Inode: 1888622     Links: 1
Access: (0644/-rw-r--r--)  Uid: (   33/www-data)   Gid: (   33/www-data)
Access: 2021-11-18 20:06:10.849068568 +0000
Modify: 2021-11-18 20:05:15.564776387 +0000
Change: 2021-11-18 20:05:15.564776387 +0000
 Birth: -

What I witnessed is that Debian unattended upgrades system triggered an upgrade of scap. From`/var/log/apt/term.log`:

Log started: 2021-11-19  06:24:50
Preparing to unpack .../python-pyasn1_0.4.2-3~bpo9+1~wmf1_all.deb ...
Unpacking python-pyasn1 (0.4.2-3~bpo9+1~wmf1) over (0.1.9-2) ...
Preparing to unpack .../python3-pyasn1_0.4.2-3~bpo9+1~wmf1_all.deb ...
Unpacking python3-pyasn1 (0.4.2-3~bpo9+1~wmf1) over (0.1.9-2) ...
Preparing to unpack .../archives/scap_4.0.3-2_all.deb ...
Unpacking scap (4.0.3-2) over (4.0.3-1+0~20211117232016.94~1.gbpafe1e5) ...
Setting up python-pyasn1 (0.4.2-3~bpo9+1~wmf1) ...
Setting up python3-pyasn1 (0.4.2-3~bpo9+1~wmf1) ...
Setting up scap (4.0.3-2) ...
Processing triggers for man-db (2.7.6.1-2) ...
Log ended: 2021-11-19  06:24:55

So scap went from 4.0.3-1+0~20211117232016.94~1.gbpafe1e5 to the new 4.0.3-2 which comes from stretch-wikimedia/main

I have moved /srv/mediawiki-staging/php-master/cache/l10n to a l10n-old. Ran Puppet which did not do anything. I have triggered a new build at https://integration.wikimedia.org/ci/job/beta-scap-sync-world/27878/console

There is:

10:48:05 10:48:05 Started l10n-update
10:48:05 10:48:05 Bootstrapping l10n cache for master
10:48:06 10:48:06 Updating ExtensionMessages-master.php
10:48:07 10:48:07 Updating LocalisationCache for master using 6 thread(s)
10:48:53 10:48:53 Last output:
10:48:53 cp: target '/srv/mediawiki-staging/php-master/cache/l10n' is not a directory

And it fails cause l10n is a regular file, not a directory!

$ stat l10n
  File: l10n
  Size: 4601428   	Blocks: 8992       IO Block: 4096   regular file
Device: fd00h/64768d	Inode: 1715810     Links: 1
Access: (0644/-rw-r--r--)  Uid: (10002/l10nupdate)   Gid: (10002/l10nupdate)
Access: 2021-11-19 10:48:06.639924401 +0000
Modify: 2021-11-19 10:48:06.647925035 +0000
Change: 2021-11-19 10:48:06.647925035 +0000
 Birth: -

I have manually created the l10n directory as root:root with 777 permissions that seems to have done the trick. So I have later made it owned by l10nupdate:l10nupdate with permissions 0755 just like in production.

To be done:

  • neither scap or puppet create the l10n directory (might be a configuration issue with deployment-prep?
  • find out why a l10n file is created rather than a directory

Mentioned in SAL (#wikimedia-releng) [2021-11-19T11:09:27Z] <hashar> deployment-prep: fixed l10n permission issue that caused scap to abort early since 6:35 UTC # T295304

As of https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/738453 (which is included in 4.0.3-1+0~20211117232016.94~1.gbpafe1e5) the cache/l10n directory is supposed to be owned by www-data and if it doesn't exist, rebuildLocalisationCache.php will create it (as a directory).

The primary problem here is that scap was auto-downgraded. It looks like that happened on some beta hosts:
deployment-deploy01.deployment-prep.eqiad1.wikimedia.cloud,deployment-echostore01.deployment-prep.eqiad1.wikimedia.cloud,deployment-imagescaler03.deployment-prep.eqiad1.wikimedia.cloud,deployment-kafka-jumbo-[1-2].deployment-prep.eqiad1.wikimedia.cloud,deployment-kafka-main-[1-2].deployment-prep.eqiad1.wikimedia.cloud,deployment-maps08.deployment-prep.eqiad1.wikimedia.cloud,deployment-ores01.deployment-prep.eqiad1.wikimedia.cloud,deployment-restbase03.deployment-prep.eqiad1.wikimedia.cloud,deployment-snapshot02.deployment-prep.eqiad1.wikimedia.cloud,deployment-webperf[11-12].deployment-prep.eqiad1.wikimedia.cloud)

But these stayed at the version of scap that I deployed:
deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud,deployment-eventlog08.deployment-prep.eqiad1.wikimedia.cloud,deployment-jobrunner04.deployment-prep.eqiad1.wikimedia.cloud,deployment-mediawiki[11-12].deployment-prep.eqiad1.wikimedia.cloud,deployment-mwmaint02.deployment-prep.eqiad1.wikimedia.cloud,deployment-parsoid12.deployment-prep.eqiad1.wikimedia.cloud,deployment-sessionstore04.deployment-prep.eqiad1.wikimedia.cloud,deployment-snapshot03.deployment-prep.eqiad1.wikimedia.cloud

Not sure why this has happened on some hosts but not all, but the explanation is simple-ish:

taavi@deployment-deploy01:~$ apt-cache policy scap
scap:
  Installed: 4.0.3-2
  Candidate: 4.0.3-1+0~20211118193528.98~1.gbp383008
  Version table:
 *** 4.0.3-2 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status
     4.0.3-1+0~20211118193528.98~1.gbp383008 1500
       1500 http://deployment-deploy01.deployment-prep.eqiad.wmflabs/repo stretch-deployment-prep/main amd64 Packages
       1500 http://deployment-deploy01.deployment-prep.eqiad.wmflabs/repo stretch-deployment-prep/main all Packages

apt.wikimedia.org has scap 4.0.3-2, which is a greater version number than 4.0.3-1+0~something. Unattended-upgrades on cloud vps has been configured to upgrade all packages (including ones from apt.wm.o) by default.

The primary problem here is that scap was auto-downgraded. It looks like that happened on some beta hosts:
deployment-deploy01.deployment-prep.eqiad1.wikimedia.cloud,deployment-echostore01.deployment-prep.eqiad1.wikimedia.cloud,deployment-imagescaler03.deployment-prep.eqiad1.wikimedia.cloud,deployment-kafka-jumbo-[1-2].deployment-prep.eqiad1.wikimedia.cloud,deployment-kafka-main-[1-2].deployment-prep.eqiad1.wikimedia.cloud,deployment-maps08.deployment-prep.eqiad1.wikimedia.cloud,deployment-ores01.deployment-prep.eqiad1.wikimedia.cloud,deployment-restbase03.deployment-prep.eqiad1.wikimedia.cloud,deployment-snapshot02.deployment-prep.eqiad1.wikimedia.cloud,deployment-webperf[11-12].deployment-prep.eqiad1.wikimedia.cloud)

All Debian Buster hosts above.

But these stayed at the version of scap that I deployed:
deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud,deployment-eventlog08.deployment-prep.eqiad1.wikimedia.cloud,deployment-jobrunner04.deployment-prep.eqiad1.wikimedia.cloud,deployment-mediawiki[11-12].deployment-prep.eqiad1.wikimedia.cloud,deployment-mwmaint02.deployment-prep.eqiad1.wikimedia.cloud,deployment-parsoid12.deployment-prep.eqiad1.wikimedia.cloud,deployment-sessionstore04.deployment-prep.eqiad1.wikimedia.cloud,deployment-snapshot03.deployment-prep.eqiad1.wikimedia.cloud

All Debian Strech hosts above.

I upgraded the scap package on beta hosts to 4.0.3-2+0~20211119163357.100~1.gbp08fad4. This version number is greater than 4.0.3-2, so it shouldn't get auto-downgraded again.

Some improvements were made but I couldn't do everything. Closing as resolved.