Running "scap pull" on deploy1001 halts for 2 minutes, then reports a php-fpm error
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Mar 5 2020, 12:22 AM

Description

krinkle at deploy1001.eqiad.wmnet in /srv/mediawiki-staging (master)
$ scap pull
00:15:54 Copying from deploy1001.eqiad.wmnet to deploy1001.eqiad.wmnet
00:15:54 Started rsync common
00:17:54 Finished rsync common (duration: 02m 00s)
00:17:54 Started scap-cdb-rebuild
00:17:55 Finished scap-cdb-rebuild (duration: 00m 00s)
00:17:55 Checking if php-fpm restart needed
00:17:55 Last output:
sudo: a password is required
00:17:55 php-fpm restart failed!

The reason for running this command is that in order for a maintenance script to take into account staged changed, they must first be applied locally.

For example, running mwscript extensions/WikimediaMaintenance/dumpInterwiki.php.

For any other server, one typically uses scap pull for that. But while this worked, it seems to block for 2 minutes, and then report a php-fpm error.

I'm not sure what it is doing during these two minutes. Running scap pull on mwdebug1002 generally only takes a few seconds. I suspected that the sudo command (which one?) or the php-fpm failure is cause of it, but the timestamps don't agree. Maybe the timestamps are off?

Details

	Subject	Repo	Branch	Lines +/-
	sudo_check_call: debug log the command to be run	mediawiki/tools/scap	master	+1 -0

Customize query in gerrit

Related Objects

Mentioned In: T334420: Scap sync stops when searching in GNU screen, starts running hours later when resuming screen
T223287: Investigate scap cluster_ssh idling until pressing ENTER repeatedly
Mentioned Here: T329857: MediaWiki deploy servers should not be mediawiki installation targets
T253547: Command line profiling not working on production

Event Timeline

Krinkle created this task.Mar 5 2020, 12:22 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 5 2020, 12:22 AM

FWIW, the pausing and the error are two different pieces. That is, the pause is cdb rebuild IIRC. The error is sudo differences in beta vs prod.

Krinkle mentioned this in T223287: Investigate scap cluster_ssh idling until pressing ENTER repeatedly.Apr 8 2020, 9:52 PM

thcipriani edited projects, added Release-Engineering-Team (thcipriani-workboard-fiddling); removed Release-Engineering-Team-TODO.Apr 20 2021, 3:41 AM

thcipriani moved this task from thcipriani-workboard-fiddling to Seen (ARCHIVE) on the Release-Engineering-Team board.Apr 20 2021, 3:46 AM

thcipriani edited projects, added Release-Engineering-Team; removed Release-Engineering-Team (thcipriani-workboard-fiddling).

thcipriani edited projects, added Release-Engineering-Team (Seen); removed Release-Engineering-Team.Apr 20 2021, 3:23 PM

I ran the command today to get another datapoint. It doesn't take a long time to run, but the sudo issue persists.

dancy@deploy1002:~$ scap pull
17:51:03 Copying from deploy1002.eqiad.wmnet:/srv/mediawiki-staging to deploy1002.eqiad.wmnet:/srv/mediawiki
17:51:03 Started rsync common
17:51:06 Finished rsync common (duration: 00m 02s)
17:51:06 Started scap-cdb-rebuild
17:51:19 Finished scap-cdb-rebuild (duration: 00m 13s)
17:51:20 Checking if php-fpm restart needed
17:51:20 Last output:
sudo: a password is required
17:51:20 php-fpm restart failed!

I think the issue is that on app servers, the mwdeploy user has this sudo permission:

(root) NOPASSWD: /usr/local/sbin/check-and-restart-php php7.2-fpm *

but on the deploy server, deployers have the following sudo permission:

(root) NOPASSWD: /bin/bash -c /usr/local/sbin/restart-php7.2-fpm

Scap is running

sudo -u root -n PHP="php7.2" -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100

Change 763313 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] sudo_check_call: debug log the command to be run

https://gerrit.wikimedia.org/r/763313

gerritbot added a project: Patch-For-Review.Feb 16 2022, 6:32 PM

Change 763313 merged by jenkins-bot:

[mediawiki/tools/scap@master] sudo_check_call: debug log the command to be run

https://gerrit.wikimedia.org/r/763313

Maintenance_bot removed a project: Patch-For-Review.Feb 16 2022, 7:10 PM

@Krinkle: I'm thinking the resolution to this ticket may be "don't run scap pull on the deploy server". What do you think?

@dancy I'd be fine with that. Essentially what scap-pull would be for, apart from accidental command runs under confusion of operating on a different ssh terminal, is for applying changes to be applied to mwscript and such. As I understand it, we now have most if not all things used /srv/mediawik-staging when on a deployment server.

I would recommend that we make this actually fail in some way to reduce confusion and to signal that it shouldn't be done. If we need it to be kept in sync for now I suppose Scap can keep doing it internally but have the public command fail with a descriptive error perhaps?

Alternatively, we could aim to swap things around. This would mean we phase out the concept of a (differently named) staging directory on the deploy host, that currently exists in addition to /srv/mediawiki also on the deploy host. This would allow us to remove various conditionals, env variables, and other hacks that currently make this work by varying the MW location in various places specifically for the deploy host (with no doubt some edge cases we missed where stuff doesn't work).

A solution to this currently blocks completion of T253547, where maintenance scripts started crashing on the deploy host due to PHP loading code from both staging and non-staging, thus require_once is ineffective. We could spread the complexity further by e.g. overriding via "php -d" the appropriate PHP ini setting when Scap runs mwscript on the deploy host. But, I'd rather start pushing in the opposite direction and instead limit, contain, and phase out this complexity.

For example, as experiment we could move the non-staging directory out of way, replace it with a temporary sync link, and delist the host from being its own sync target. If that works for a while without trouble, I can complete the CLI profiling work meanwhile.

dancy updated the task description. (Show Details)Feb 27 2023, 5:35 PM

Krinkle mentioned this in T334420: Scap sync stops when searching in GNU screen, starts running hours later when resuming screen.Apr 10 2023, 6:33 PM

Guessing this one is moot given the work on T329857: MediaWiki deploy servers should not be mediawiki installation targets—is that right?

In T246959#8817716, @thcipriani wrote:

Guessing this one is moot given the work on T329857: MediaWiki deploy servers should not be mediawiki installation targets—is that right?

That's right. Today if you run scap pull on a deploy server, it will print a warning and do nothing (but still exit with status 0).

$ scap pull
23:15:43 /srv/mediawiki is a symlink to /srv/mediawiki-staging.  Not pulling.

dancy closed this task as Resolved.May 1 2023, 11:18 PM

dancy claimed this task.

Running "scap pull" on deploy1001 halts for 2 minutes, then reports a php-fpm errorClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Running "scap pull" on deploy1001 halts for 2 minutes, then reports a php-fpm error
Closed, ResolvedPublic
Actions