Page MenuHomePhabricator

'scap pull' stopped working on appservers ?
Closed, ResolvedPublic

Description

We normally use "scap pull" on appservers after hardware maintenance (like T205240) or other downtime to make sure they are in sync with other appservers.

This appears to have stopped working. On mw2181 it fails like this, with mwscript missing:

[mw2181:~] $ scap pull
18:14:34 Copying from deployment.codfw.wmnet to mw2181.codfw.wmnet
18:14:34 Started rsync common
cannot delete non-empty directory: php-1.33.0-wmf.23/cache/l10n
cannot delete non-empty directory: php-1.33.0-wmf.23/cache/l10n
cannot delete non-empty directory: php-1.33.0-wmf.23/cache
cannot delete non-empty directory: php-1.33.0-wmf.23/cache
cannot delete non-empty directory: php-1.33.0-wmf.23
cannot delete non-empty directory: php-1.32.0-wmf.3/cache/l10n
cannot delete non-empty directory: php-1.32.0-wmf.3/cache/l10n
cannot delete non-empty directory: php-1.32.0-wmf.3/cache
cannot delete non-empty directory: php-1.32.0-wmf.3/cache
cannot delete non-empty directory: php-1.32.0-wmf.3
18:14:52 Finished rsync common (duration: 00m 17s)
18:14:52 Started scap-cdb-rebuild
18:14:53 Finished scap-cdb-rebuild (duration: 00m 00s)
18:14:53 Running refreshMessageBlobs.php for each wiki
18:14:53 Last output:
sudo: /usr/local/bin/mwscript: command not found
18:14:53 Unhandled error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 342, in run
    exit_status = app.main(app.extra_arguments)
  File "/usr/lib/python2.7/dist-packages/scap/main.py", line 713, in main
    tasks.clear_message_blobs()
  File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 402, in context_wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 809, in clear_message_blobs
    '/usr/local/bin/mwscript '
  File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 402, in context_wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 497, in sudo_check_call
    raise subprocess.CalledProcessError(proc.returncode, cmd)
CalledProcessError: Command '/usr/local/bin/mwscript extensions/WikimediaMaintenance/refreshMessageBlobs.php' returned non-zero exit status 1
18:14:53 pull failed: <CalledProcessError> Command '/usr/local/bin/mwscript extensions/WikimediaMaintenance/refreshMessageBlobs.php' returned non-zero exit status 1

Event Timeline

Is this a circular dependency? mwscript (== /usr/local/bin/mwscript) is not part of the appserver base?

refreshMessageBlobs was added in T222539

One of two solutions:

  • Install scap::scripts on all appservers rather than canary appservers
  • rethink how this is included in scap

refreshMessageBlobs was added in T222539

One of two solutions:

  • Install scap::scripts on all appservers rather than canary appservers
  • rethink how this is included in scap

Rethinking how this works in scap:

  • refreshMessageBlobs clears the global cache and has no effect on an individual appserver
  • we need to run refreshMessageBlobs after any localisation updates are deployed (after scap sync -- this is currently happening)
  • we run this as part of scap pull to aid in testing via mwdebug servers
  • purging the blobstore on scap pull means that the data will be purged from the global cache
  • If the global cache is repopulated by requests to appservers, then maybe there's not a lot of reason to run this on a scap pull since any request to a normal appserver that happens before the test request to mwdebug would re-populate the cache with the currently deployed information, rather than the messageBlob you're testing

So either: I'm misunderstanding why we'd want to run refreshMessageBlobs, I'm misunderstanding how the messageBlobs cache is re-populated, or we can remove this from scap pull

I talked a little bit about this with @Krinkle in IRC -- adding him. Also adding @Catrope since he worked on this a bit. Can either of you help me understand this a bit better?

Change 523999 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/tools/scap@master] refreshMessageBlobs: don't run on scap pull

https://gerrit.wikimedia.org/r/523999

A side effect of this is also that reimaging hosts fails now. wmf-auto-reimage-host waits for a successful puppet run and the initial puppet run is broken because it tries to execute scap pull.

Dzahn triaged this task as High priority.Jul 18 2019, 6:01 PM
greg moved this task from Ready to Doing on the Release-Engineering-Team-TODO (201907) board.
greg moved this task from Needs triage to Debt on the Scap board.

Change 523999 merged by jenkins-bot:
[mediawiki/tools/scap@master] refreshMessageBlobs: don't run on scap pull

https://gerrit.wikimedia.org/r/523999

Mentioned in SAL (#wikimedia-operations) [2019-07-19T00:04:30Z] <mutante> install1002 - exported indices for new scap version - copied back from buster to stretch - upgraded scap version on mw2250 - scap pull now works and starts to rsync (T228482, T228328, T226948)

the above was after:

19:50 < mutante> !log built new scap version 3.11.1-1 on boron, copied to install1002, imported package with reprepro, copied from stretch to jessie and buster (T228482)

Dzahn lowered the priority of this task from High to Medium.Jul 19 2019, 12:18 AM

the new scap version fixes this issue on mw2250. scap pull works there again.

new scap version just needs to be rolled out across the cluster now.

Mentioned in SAL (#wikimedia-operations) [2019-07-23T18:43:43Z] <mutante> rolling out scap 3.11.1-1 on mw canary servers (T228328)

Mentioned in SAL (#wikimedia-operations) [2019-07-23T18:45:36Z] <mutante> rolling out scap 3.11.1-1 on all mw codfw servers (T228328)

Mentioned in SAL (#wikimedia-operations) [2019-07-23T22:14:19Z] <mutante> continuing rollout of new scap version 3.11.1-1, starting with kafka-all followed by other cumin-alias groups (T228328)

Dzahn closed subtask T228482: Deploy scap 3.11.1-1 as Resolved.

deployed globally in the subtask. should be resolved now.