Page MenuHomePhabricator

Running "scap pull" on deploy1001 halts for 2 minutes, then reports a php-fpm error
Closed, ResolvedPublic

Description

krinkle at deploy1001.eqiad.wmnet in /srv/mediawiki-staging (master)
$ scap pull
00:15:54 Copying from deploy1001.eqiad.wmnet to deploy1001.eqiad.wmnet
00:15:54 Started rsync common
00:17:54 Finished rsync common (duration: 02m 00s)
00:17:54 Started scap-cdb-rebuild
00:17:55 Finished scap-cdb-rebuild (duration: 00m 00s)
00:17:55 Checking if php-fpm restart needed
00:17:55 Last output:
sudo: a password is required
00:17:55 php-fpm restart failed!

The reason for running this command is that in order for a maintenance script to take into account staged changed, they must first be applied locally.

For example, running mwscript extensions/WikimediaMaintenance/dumpInterwiki.php.

For any other server, one typically uses scap pull for that. But while this worked, it seems to block for 2 minutes, and then report a php-fpm error.

I'm not sure what it is doing during these two minutes. Running scap pull on mwdebug1002 generally only takes a few seconds. I suspected that the sudo command (which one?) or the php-fpm failure is cause of it, but the timestamps don't agree. Maybe the timestamps are off?

Event Timeline

thcipriani triaged this task as Medium priority.Mar 25 2020, 5:01 PM
thcipriani subscribed.

FWIW, the pausing and the error are two different pieces. That is, the pause is cdb rebuild IIRC. The error is sudo differences in beta vs prod.

I ran the command today to get another datapoint. It doesn't take a long time to run, but the sudo issue persists.

dancy@deploy1002:~$ scap pull
17:51:03 Copying from deploy1002.eqiad.wmnet:/srv/mediawiki-staging to deploy1002.eqiad.wmnet:/srv/mediawiki
17:51:03 Started rsync common
17:51:06 Finished rsync common (duration: 00m 02s)
17:51:06 Started scap-cdb-rebuild
17:51:19 Finished scap-cdb-rebuild (duration: 00m 13s)
17:51:20 Checking if php-fpm restart needed
17:51:20 Last output:
sudo: a password is required
17:51:20 php-fpm restart failed!

I think the issue is that on app servers, the mwdeploy user has this sudo permission:

(root) NOPASSWD: /usr/local/sbin/check-and-restart-php php7.2-fpm *

but on the deploy server, deployers have the following sudo permission:

(root) NOPASSWD: /bin/bash -c /usr/local/sbin/restart-php7.2-fpm

Scap is running

sudo -u root -n PHP="php7.2" -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100

Change 763313 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] sudo_check_call: debug log the command to be run

https://gerrit.wikimedia.org/r/763313

Change 763313 merged by jenkins-bot:

[mediawiki/tools/scap@master] sudo_check_call: debug log the command to be run

https://gerrit.wikimedia.org/r/763313

@Krinkle: I'm thinking the resolution to this ticket may be "don't run scap pull on the deploy server". What do you think?

@dancy I'd be fine with that. Essentially what scap-pull would be for, apart from accidental command runs under confusion of operating on a different ssh terminal, is for applying changes to be applied to mwscript and such. As I understand it, we now have most if not all things used /srv/mediawik-staging when on a deployment server.

I would recommend that we make this actually fail in some way to reduce confusion and to signal that it shouldn't be done. If we need it to be kept in sync for now I suppose Scap can keep doing it internally but have the public command fail with a descriptive error perhaps?

Alternatively, we could aim to swap things around. This would mean we phase out the concept of a (differently named) staging directory on the deploy host, that currently exists in addition to /srv/mediawiki also on the deploy host. This would allow us to remove various conditionals, env variables, and other hacks that currently make this work by varying the MW location in various places specifically for the deploy host (with no doubt some edge cases we missed where stuff doesn't work).

A solution to this currently blocks completion of T253547, where maintenance scripts started crashing on the deploy host due to PHP loading code from both staging and non-staging, thus require_once is ineffective. We could spread the complexity further by e.g. overriding via "php -d" the appropriate PHP ini setting when Scap runs mwscript on the deploy host. But, I'd rather start pushing in the opposite direction and instead limit, contain, and phase out this complexity.

For example, as experiment we could move the non-staging directory out of way, replace it with a temporary sync link, and delist the host from being its own sync target. If that works for a while without trouble, I can complete the CLI profiling work meanwhile.

Guessing this one is moot given the work on T329857: MediaWiki deploy servers should not be mediawiki installation targets—is that right?

That's right. Today if you run scap pull on a deploy server, it will print a warning and do nothing (but still exit with status 0).

$ scap pull
23:15:43 /srv/mediawiki is a symlink to /srv/mediawiki-staging.  Not pulling.
dancy claimed this task.