Page MenuHomePhabricator

scap deploy --init on deployment server fails on first puppet run
Open, Needs TriagePublic

Description

From integration/docroot scap configuration for T256005:

1root@deploy1001:/srv/deployment$ run-puppet-agent
2Info: Using configured environment 'production'
3Info: Retrieving pluginfacts
4Info: Retrieving plugin
5Info: Retrieving locales
6Info: Loading facts
7Info: Caching catalog for deploy1001.eqiad.wmnet
8Info: Applying configuration version '(f13690ff17) Antoine Musso - Fix scap config for integration/docroot'
9Error: Execution of '/usr/bin/scap deploy --init' returned 70: 14:10:06 Started setup [integration/docroot@708d3eb]
1014:10:06 Deploying Rev: HEAD = 708d3eba6bf056e8bfb9ff516f8ee93108880cab
1114:10:06 Finished setup [integration/docroot@708d3eb] (duration: 00m 00s)
1214:10:06 Unhandled error:
13Traceback (most recent call last):
14 File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 341, in run
15 exit_status = app.main(app.extra_arguments)
16 File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 721, in main
17 git.tag_repo(self.deploy_info, location=self.context.root)
18 File "/usr/lib/python2.7/dist-packages/scap/git.py", line 486, in tag_repo
19 subprocess.check_call(cmd, shell=True)
20 File "/usr/lib/python2.7/subprocess.py", line 186, in check_call
21 raise CalledProcessError(retcode, cmd)
22CalledProcessError: Command '
23 /usr/bin/git tag -fa \
24 -m 'user trebuchet' \
25 -m 'timestamp 2020-07-07T14:10:06.379026' -- \
26 scap/sync/2020-07-07/0001 708d3eba6bf056e8bfb9ff516f8ee93108880cab
27 ' returned non-zero exit status 128
2814:10:06 deploy failed: <CalledProcessError> Command '
29 /usr/bin/git tag -fa \
30 -m 'user trebuchet' \
31 -m 'timestamp 2020-07-07T14:10:06.379026' -- \
32 scap/sync/2020-07-07/0001 708d3eba6bf056e8bfb9ff516f8ee93108880cab
33 ' returned non-zero exit status 128
34
35Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[integration/docroot]/Scap_source[integration/docroot]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 70: 14:10:06 Started setup [integration/docroot@708d3eb]
3614:10:06 Deploying Rev: HEAD = 708d3eba6bf056e8bfb9ff516f8ee93108880cab
3714:10:06 Finished setup [integration/docroot@708d3eb] (duration: 00m 00s)
3814:10:06 Unhandled error:
39Traceback (most recent call last):
40 File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 341, in run
41 exit_status = app.main(app.extra_arguments)
42 File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 721, in main
43 git.tag_repo(self.deploy_info, location=self.context.root)
44 File "/usr/lib/python2.7/dist-packages/scap/git.py", line 486, in tag_repo
45 subprocess.check_call(cmd, shell=True)
46 File "/usr/lib/python2.7/subprocess.py", line 186, in check_call
47 raise CalledProcessError(retcode, cmd)
48CalledProcessError: Command '
49 /usr/bin/git tag -fa \
50 -m 'user trebuchet' \
51 -m 'timestamp 2020-07-07T14:10:06.379026' -- \
52 scap/sync/2020-07-07/0001 708d3eba6bf056e8bfb9ff516f8ee93108880cab
53 ' returned non-zero exit status 128
5414:10:06 deploy failed: <CalledProcessError> Command '
55 /usr/bin/git tag -fa \
56 -m 'user trebuchet' \
57 -m 'timestamp 2020-07-07T14:10:06.379026' -- \
58 scap/sync/2020-07-07/0001 708d3eba6bf056e8bfb9ff516f8ee93108880cab
59 ' returned non-zero exit status 128
60
61Notice: Applied catalog in 39.96 seconds

Related Objects

Event Timeline

One of the issue is that we only know that git exited with return code 128 and loose the useful stderr output. It seems to be swallowed by Scap.

Related to this, although a slightly different issue, scap deploy --init additionally failed on the non-primary deployment server because scap is disabled there. I think scap syncronization (all methods) should be disabled because of the lock, but probably --init should be allowed and not respect the lock to prevent a dependency loop?

Same issue here when trying to setup deploy1002 the successor of deploy1001: T265963#6660917

And same suggestion how to fix that Jaime made above. scap sync should NOT be allowed but scap deploy --init should be allowed. Then we wouldn't have puppet errors and could make sure everything is ok BEFORE having to switch the deployment server and making it the new active one.

One of the issue is that we only know that git exited with return code 128 and loose the useful stderr output. It seems to be swallowed by Scap.

That happens here, apparently:
rMSCA /scap/cli.py:247

there is a "scap-sync-master" command which can be run manually on new deployment servers, it takes care of /srv/mediawiki-staging and /srv/patches.

Additionally I added puppet code so that /srv/patches is handled like /srv/deployment with automatic rsync.

Finally you can run "scap pull" on a new host twice to fill up /srv/mediawiki and it should be fine (after maybe an error on first run).

hashar added a subscriber: Izno.

I have reverted the task dependencies changes made by @Izno. This task is not a blocker of updating MediaWiki app servers to Buster, it is merely an actionable that needs to be solved when provisioning deployment servers.

I have this exact same problem _again_ when trying to create a new deployment server and I find 3 or 4 tickets about but I still don't know how to solve it.

there is a "scap-sync-master" command which can be run manually on new deployment servers, it takes care of /srv/mediawiki-staging and /srv/patches.

bash: scap-sync-master: command not found

This is now blocking T306069. The current deployment server in devtools will be deleted in 2 weeks.

there is a "scap-sync-master" command which can be run manually on new deployment servers, it takes care of /srv/mediawiki-staging and /srv/patches.

bash: scap-sync-master: command not found

This should be scap sync-master with a space. sync-master is a subcommand.

Thank you but:

root@deploy-1004:/home/dzahn# scap sync-master
usage: scap [-h] <command> ...
scap: error: argument <command>: invalid choice: 'sync-master' (choose from 'apply-patches', 'backport', 'cdb-json-refresh', 'cdb-rebuild', 'clean', 'deploy', 'deploy-local', 'deploy-log', 'deploy-mediawiki', 'deploy-promote', 'fortune', 'lock', 'patch', 'prep', 'pull', 'pull-master', 'say', 'security-check', 'stage-train', 'sync-dir', 'sync-file', 'sync-l10n', 'sync-wikiversions', 'sync-world', 'test-progress', 'update-interwiki-cache', 'update-wikiversions', 'version', 'wikiversions-compile', 'wikiversions-inuse', 'wmf-beta-autoupdate')

In previous tickets I commented it was somehow worked around with rsync.

After some investigation, this is failing due to a lock puppet puts in place on the passive servers:

modules/profile/manifests/mediawiki/deployment/server.pp
if $deploy_ensure == 'present' {
    # Lock the passive servers, leave untouched the active one.
    file { '/var/lock/scap-global-lock':
        ensure  => 'present',
        owner   => 'root',
        group   => 'root',
        content => "Not the active deployment server, use ${main_deployment_server}",
    }
}

^ that prevents scap deploy --init from running when we setup new deployment machines. All scap deploy --init does is write the file: /srv/deployment/<repo>/DEPLOY_HEAD. I think there's a dependency relationship missing for the deployment server setup. The lock file shouldn't be created until after the /srv/deployment scap repos are done getting setup.

More or less the same issue happened when adding a new repo yesterday ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/860837 ). Puppet failed on scap deploy --init:

$ sudo run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for deploy1002.eqiad.wmnet
Info: Applying configuration version '(10b34a50d5) RLazarus - jenkins: add remaining config for Scap3 deployment'
Error: Execution of '/usr/bin/scap deploy --init' returned 1: 
Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[releng/jenkins-deploy]/Scap_source[releng/jenkins-deploy]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 1: 
Notice: Applied catalog in 44.38 seconds

/srv/deployment/releng/jenkins-deploy has been created on deploy1002. I am guessing a second Puppet run would have fixed the transient issue, it would be nice to try again with puppet agent --debug -tv to get more details as to what has failed, or maybe make the exec to log its output.

Running the command manually:

deploy1002:/srv/deployment/releng/jenkins-deploy$ scap deploy --init > /dev/null
Incomplete setup: git_repo must be defined in the configuration

So that is a different issue and lies in a lack of scap/scap.cfg in that repository. It would be nice to have modules/scap/lib/puppet/provider/scap_source/default.rb to somehow report output the error but that is an entirely different topic I guess.

T332623

We are still getting tin.eqiad.wmnet defined as initial git server.

This is a brandnew machine created merely days ago:

root@miscweb1003:/srv/deployment/iegreview# grep git_server iegreview-cache/.config 
git_server: tin.eqiad.wmnet

Of course it fails to deploy from tin, since tin isn't a thing since many years.

This issue has meanwhile been reported multiple times over the years but it's still coming back every single time we create a new machine that has things deployed via scap on it.

Mentioned in SAL (#wikimedia-operations) [2023-03-20T19:48:58Z] <mutante> miscweb1003 - manually edit /srv/deployment/iegreview/iegreview-cache/.config and replace tin.eqiad.wmnet with deployment.eqiad.wmnet (which is an alias for deploy2002.codfw.wmnet) T257317 T332623 T331896

Mentioned in SAL (#wikimedia-operations) [2023-04-25T21:12:02Z] <mutante> once again running into T257317 when applying gerrit role to new hardware

Mentioned in SAL (#wikimedia-operations) [2023-04-25T21:14:07Z] <mutante> gerrit1003 - manually replacing deploy2002 with deploy1002 in /srv/deployment/gerrit/gerrit-cache/.config to fix initial scap deployment T257317 T326368

Mentioned in SAL (#wikimedia-releng) [2023-04-25T21:23:23Z] <mutante> gerrit1003 - sudo -u gerrit2 /usr/bin/scap deploy-local --repo gerrit/gerrit -D log_json:False (manually it works, but that's the same command that puppet is supposed to run !?) - T257317 T326368

I ran into this again when trying to create a deployment_server on bullseye to replace buster machines in devtools.

T363415#9762416 - fixed so far by manually running scap deploy --init -Dblock_deployments:False in each deploy dir under /srv/deployments.

works except for the "gervert" repo which somehow appears by default but now can't find a gerrit server dsh group anymore