Page MenuHomePhabricator

Investigate scap fails to create /srv/mediawiki-staging/php-1.43.0-wmf.20/cache/l10n
Closed, ResolvedPublic

Description

When running the MediaWiki train (T366965), scap failed to rebuild the localisation cache.

10:16:56 Running rebuildLocalisationCache.php as www-data
10:16:57 RuntimeException from line 88 of /srv/mediawiki-staging/php-1.43.0-wmf.20/includes/language/LCStoreCDB.php: Unable to create the localisation store directory "/srv/mediawiki-staging/php-1.43.0-wmf.20/cache/l10n"
10:16:57 #0 /srv/mediawiki-staging/php-1.43.0-wmf.20/includes/language/LocalisationCache.php(1285): LCStoreCDB->startWrite('en')
10:16:57 #1 /srv/mediawiki-staging/php-1.43.0-wmf.20/includes/language/LocalisationCacheBulkLoad.php(103): LocalisationCache->recache('en')
10:16:57 #2 /srv/mediawiki-staging/php-1.43.0-wmf.20/maintenance/rebuildLocalisationCache.php(256): LocalisationCacheBulkLoad->recache('en')
10:16:57 #3 /srv/mediawiki-staging/php-1.43.0-wmf.20/maintenance/rebuildLocalisationCache.php(188): RebuildLocalisationCache->doRebuild(Array, Object(LocalisationCacheBulkLoad), false)
10:16:57 #4 /srv/mediawiki-staging/php-1.43.0-wmf.20/maintenance/includes/MaintenanceRunner.php(703): RebuildLocalisationCache->execute()
10:16:57 #5 /srv/mediawiki-staging/php-1.43.0-wmf.20/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run()
10:16:57 #6 /srv/mediawiki-staging/multiversion/MWScript.php(158): require_once('/srv/mediawiki-...')
10:16:57 #7 {main}
10:16:57 Failed.
10:16:57 Finished l10n-update (duration: 00m 01s)
10:16:57 Unhandled error:
Traceback (most recent call last):
  File "/var/lib/scap/scap/lib/python3.9/site-packages/scap/cli.py", line 593, in run
    exit_status = app.main(app.extra_arguments)
  File "/var/lib/scap/scap/lib/python3.9/site-packages/scap/main.py", line 1014, in main
    return super().main(*extra_args)
  File "/var/lib/scap/scap/lib/python3.9/site-packages/scap/main.py", line 138, in main
    self._update_caches()
  File "/var/lib/scap/scap/lib/python3.9/site-packages/scap/main.py", line 1039, in _update_caches
    tasks.update_localization_cache(version, self)
  File "/var/lib/scap/scap/lib/python3.9/site-packages/scap/utils.py", line 328, in context_wrapper
    return func(*args, **kwargs)
  File "/var/lib/scap/scap/lib/python3.9/site-packages/scap/tasks.py", line 617, in update_localization_cache
    _call_rebuildLocalisationCache(
  File "/var/lib/scap/scap/lib/python3.9/site-packages/scap/tasks.py", line 553, in _call_rebuildLocalisationCache
    _rebuild("LCStoreCDB", "cdb")
  File "/var/lib/scap/scap/lib/python3.9/site-packages/scap/tasks.py", line 531, in _rebuild
    utils.sudo_check_call(
  File "/var/lib/scap/scap/lib/python3.9/site-packages/scap/utils.py", line 328, in context_wrapper
    return func(*args, **kwargs)
  File "/var/lib/scap/scap/lib/python3.9/site-packages/scap/utils.py", line 462, in sudo_check_call
    raise subprocess.CalledProcessError(proc.returncode, cmd)
subprocess.CalledProcessError: Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki=aawiki --force-version "1.43.0-wmf.20" --no-progress --store-class=LCStoreCDB --threads=22 --lang en  --quiet ' returned non-zero exit status 1.

Event Timeline

hashar triaged this task as Unbreak Now! priority.Aug 27 2024, 10:30 AM

/srv/mediawiki-staging/php-1.43.0-wmf.20/cache/ is owned by mwpresync:deployment with permissions rwxrwsr-x / 2775.

That differs from the previous weeks permissions which had drwxrwsrwx / 2777 (aka writable by others). I took /srv/mediawiki-staging/php-1.43.0-wmf.19/cache/ as a reference.

So surely the l10n directory creation failed!

I have manually changed the file mode using:

sudo -u mwpresync chmod o+w /srv/mediawiki-staging/php-1.43.0-wmf.20/cache/
hashar claimed this task.

php-1.43.0-wmf.18/cache/ and php-1.43.0-wmf.19/cache/ were created at 3:02am UTC in the last couple weeks so. scap has:

cache_dir = os.path.join(dest_dir, "cache")
if os.geteuid() == os.stat(cache_dir).st_uid:
    logger.debug("Making cache dir world writable")
    os.chmod(cache_dir, os.stat(cache_dir).st_mode | 0o777)

If I look for Making cache dir world writable I see entry around 3am for August 20 or August 13th but there is none last night. The chmod has never occurred, the cache directory is thus not writable by others and rebuildLocalisationCache.php can't write to it since it runs as www-data.

The reason is the branch cut had failed. The automatic presync did not pass, from systemctl status train-presync:

  Active: inactive (dead) since Tue 2024-08-27 03:00:34 UTC; 7h ago
 Process: 4019974 ExecStart=/usr/local/bin/systemd-timer-mail-wrapper --subject train-presync --mail-to releng@lists.wikimedia.org /usr/bin/scap stage-train -Dfull_image_build:True --yes auto (code=exited, status=70)
Main PID: 4019974 (code=exited, status=70)

That is the script invoking scap stage-train and thus scap prep and thus doing the chmod 777 cache. Instead I went to use scape train which does NOT run that step.

dancy lowered the priority of this task from Unbreak Now! to Medium.
thcipriani renamed this task from scap fails to create /srv/mediawiki-staging/php-1.43.0-wmf.20/cache/l10n to Investigate scap fails to create /srv/mediawiki-staging/php-1.43.0-wmf.20/cache/l10n.Aug 28 2024, 5:18 PM
thcipriani subscribed.

Question for investigation:

  • What were the steps leading to this state?
    • How did you check out the code for this week's train if not by using scap prep?
    • The output above from rebuildLocalisationCache—what command was being run at that time?

Branch cut fails

The automatic branch cut failed: build 243, which I have marked to be kept indefinitely. It ran:

./branch.py --core --core-bundle wmf_core --bundle wmf_branch --branchpoint HEAD --core-version 1.43.0-wmf.20 --push-option l=Code-Review+2 --task T366965 wmf/1.43.0-wmf.20

Which has created the wmf branches in Gerrit for all repositories by using the HTTP REST API.

The script then clones mediawiki/core and checkout the new wmf/1.43.0-wmf.20, adds all repositories as submodules, craft a commit and then pushes it with: git push origin HEAD:refs/for/wmf/1.43.0-wmf.20 --push-option=l=Code-Review+2. That was rejected with:

Output: fatal: could not read Username for 'https://gerrit.wikimedia.org': No such device or address

The reason is T373441.

Status

At this point mediawiki/core has a wmf/1.43.0-wmf.20 which has been created from master. The branch does not have any submodule registered since the commit has not been pushed.

The branch cut occurred at 1:00 UTC.

Presync fails

I wake up, check my emails and my inbox looks like (times are CEST / UTC+2):

branch_cut_emails.png (114×875 px, 49 KB)

The culprit build # 243 is following another failure and I automatically discard it as if it was the branch cut pretest that failed. It is known to be flappy due to various reasons (changes to scap, Gerrit, CI test failing etc).

The FAIL: train-presync is the systemd timer that failed with:

Its return value was 70 and emitted the following output:

03:00:07 Initializing stage-train auto mode
03:00:07 Retrieving train information...
03:00:10 Using version 1.43.0-wmf.20
03:00:10 ----------------------------
1. Starting: prep
03:00:10 Started scap prep 1.43.0-wmf.20
03:00:10 Copying patches from /srv/patches/1.43.0-wmf.19 to /srv/patches/1.43.0-wmf.20
03:00:10 Clone https://gerrit.wikimedia.org/r/mediawiki/core (wmf/1.43.0-wmf.20 branch) in /srv/mediawiki-staging/php-1.43.0-wmf.20
03:00:34 https://gerrit.wikimedia.org/r/mediawiki/core checked out at commit 1faf18d657043dcff8b5f67946f83db188ee6707
03:00:34 Finished scap prep 1.43.0-wmf.20 (duration: 00m 23s)
/srv/mediawiki-staging/php-1.43.0-wmf.20/.gitmodules does not exist. Did the train branch commit get merged?
03:00:34 stage-train failed: <CalledProcessError> Command '['/usr/bin/scap', 'prep', '1.43.0-wmf.20', '-D', 'full_image_build:True']' returned non-zero exit status 1.

That runs scap stage-train which does clone mediawiki/core wmf branch. It then rightfully aborts since the branch does not have a .gitmodules file (since the commit was never sent to Gerrit):

/srv/mediawiki-staging/php-1.43.0-wmf.20/.gitmodules does not exist. Did the train branch commit get merged?

But I have dismissed that [RelEng] FAIL: train-presync email cause it is immediately followed by a [RelEng] PASS: train-clean. My brain deduplicated the emails solely based on the first three stems of the subject. I have thus assumed the automatic system worked like a charm and I have probably processed those emails in a mere couple seconds.

Status

At this point, the deployment server has mediawiki/core checked out with wmf/1.43.0-wmf.20.

Incomplete state

Since scap prep has aborted, there are bunch of steps that are not conducted. The logic is in https://gitlab.wikimedia.org/repos/releng/scap/-/blob/master/scap/plugins/prep.py and the following code is not executed:

# This is only needed while people still do manual checkout
# manipulation (a practice which needs to end).
update_update_strategy(dest_dir)

logger.debug("Creating LocalSettings.php stub")
write_settings_stub(os.path.join(dest_dir, "LocalSettings.php"))
    
cache_dir = os.path.join(dest_dir, "cache")
if os.geteuid() == os.stat(cache_dir).st_uid:
    logger.debug("Making cache dir world writable")
    os.chmod(cache_dir, os.stat(cache_dir).st_mode | 0o777)

if apply_patches:
    with utils.suppress_backtrace():
        args = [
            self.get_script_path(),
            "apply-patches",
            "--abort-git-am-on-fail",
            "--train",
            branch,
        ] + self.format_passthrough_args()
        subprocess.check_call(args)
    SecurityPatches.fix_mtimes(pre_patch_state)

logger.info("MediaWiki %s successfully checked out." % checkout_version)

I don't know what update_update_strategy() is for. I am assuming LocalSettings.php got created later on and the patches definitely got applied.

The bit we really missed is:

cache_dir = os.path.join(dest_dir, "cache")
if os.geteuid() == os.stat(cache_dir).st_uid:
    logger.debug("Making cache dir world writable")
    os.chmod(cache_dir, os.stat(cache_dir).st_mode | 0o777)

scap train fails

I then run scap train for the testwikis (or group 0). It fails and I file this task.

The issue is that mediawiki/core cache directory wasn't made world writable and thus rebuildLocalisationCache.php was unable to write to it (it runs as www-data).

It has however created the LocalSettings.php since on the deployment server it is marked as 10:06UTC:

-rw-rw-r-- 1 hashar deployment 170 Aug 27 10:06 /srv/mediawiki-staging/php-1.43.0-wmf.20/LocalSettings.php

Which would be:

10:06 	<hashar@deploy1003> 	Started scap sync-world: testwikis to 1.43.0-wmf.20 refs T366965

dancy opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/424

scap prep: Make mediawiki cache dir world writable immediately after cloning

Got it.

I see two problems:

  1. Automatic branch cut jenkins job assumption that mediawiki/core mainline will not change between the time it creates the branch and the time it adds submodules to the branch and pushes
  2. scap prep does checkout and then checks for .gitmodules before updating permissions on cache. A missing .gitmodules will mean that cache cannot be written to which means l10n will fail.

And a question:

  • Should the netrc check in make-release/automatic-branch-cut fail harder/be more strict?
    • Cutting the branch uses requests
      • requests uses $NETRC and falls back to $HOME/.netrc and falls back to $HOME/_netrc
    • Adding submodules uses git push
    • automatic-branch-cut tries to make these two consistent.
    • If it can't make those two consistent, it seems reasonable to fail there, rather than move forward when either branch cutting or submodule pushing will fail.

Here are my overlong notes about what happened:

Code and where it runs:

  • Releases jenkins
    • Runs the Automatic branch cut job defined in jenkins-deploy
    • Automatic branch cut uses release repo's make-release/automatic-branch-cut script
    • make-release/automatic-branch-cut uses the python requests library's get_netrc_auth
  • Deployment server

Sequence of events:

UTC Overnight
  1. 2024-08-26 20:45:00: Releases Jenkins Automatic branch cut job was updated to set NETRC environment variable instead of netrc_file (commit)
  2. 2024-08-27 01:00:00: Automatic branch cut job triggered
  3. 2024-08-27 01:00:12: Job is able to create branch because requests get_netrc_auth checks NETRC and falls back to ~/.netrc.
  4. 2024-08-27 01:08:24: git push fails because it uses libcurl which looks for ~/.netrc which is no longer present.
  5. 2024-08-27 03:00:00: Deployment server /usr/bin/scap stage-train -Dfull_image_build:True --yes auto
  6. 2024-08-27 03:00:34 "Finished scap prep 1.43.0-wmf.20 (duration: 00m 23s) /srv/mediawiki-staging/php-1.43.0-wmf.20/.gitmodules does not exist. Did the train branch commit get merged?"

At this point, scap stage train has:

  • Checked the current train phab task for the version
  • Checked that the branch exists in gerrit
  • Set up the next security patch directory for this version
  • You have a /srv/mediawiki-staging/php-1.43.0-wmf.20 with MediaWiki/core checked out to wmf/1.43.0-wmf.20 BUT it has not submodules.
UTC Early morning
  1. 2024-08-27 08:42:00: Revert Automatic branch cut job change (commit)
  2. 2024-08-27 09:03:00: Re-ran Automatic branch cut job manually. This failed to push due to ERROR: Implicit Merge of xxxxxxx <commit>. To https://gerrit.wikimedia.org/r/mediawiki/core. This was caused because the branch script checks out mediawiki/core's master branch, adds submodules, and tries to push it as a new version. But the branch was created already. So this meant the job was trying to add in things merged to master after the branch cut (for example this commit merged at 2024-08-27 at 01:00 UTC)
  3. 2024-08-27 09:25:32: !log train: fast forwarded mediawiki/core wmf/1.43.0-wmf.20 from 1faf18d6570 to ef87455d7c3 # T366965
  4. 2024-08-27 09:25:00: Re-ran job, this time it works
  5. 2024-08-27 10:05:37: scap train which runs deploy-promote which runs scap prep auto
  6. 2024-08-27 10:06:54: Runtime exception due to permission problems.
  7. 2024-08-27 10:33:44: hashar I went with sudo -u mwpresync chmod o+w /srv/mediawiki-staging/php-1.43.0-wmf.20/cache/

thcipriani merged https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/424

scap prep: Make mediawiki cache dir world writable immediately after cloning