Page MenuHomePhabricator

Scap3 promote stage not working
Closed, ResolvedPublic

Description

Today I tried to deploy a new service, changeprop. Scap3 was able to fetch and check out the code, but the promote stage failed without a clear reason why.

mobrovac@tin:/srv/deployment/changeprop/deploy$ deploy -v --force
18:06:13 Started deploy_changeprop/deploy
18:06:13 Update server info
Entering 'src'
18:06:13 
== CANARY ==
:* scb1001.eqiad.wmnet
18:06:13 Running remote deploy cmd ['/usr/bin/deploy-local', '-v', '--repo', 'changeprop/deploy', '--force', '-g', 'canary', 'fetch']
18:06:13 Creating /srv/deployment/changeprop/deploy/.git/DEPLOY_HEAD
deploy_changeprop/deploy_fetch: 100% (ok: 1; fail: 0; left: 0)                  
18:06:15 Running remote deploy cmd ['/usr/bin/deploy-local', '-v', '--repo', 'changeprop/deploy', '--force', '-g', 'canary', 'config_deploy']
18:06:15 Creating /srv/deployment/changeprop/deploy/.git/DEPLOY_HEAD
deploy_changeprop/deploy_config_deploy: 100% (ok: 1; fail: 0; left: 0)          
18:06:15 Running remote deploy cmd ['/usr/bin/deploy-local', '-v', '--repo', 'changeprop/deploy', '--force', '-g', 'canary', 'promote']
18:06:15 Creating /srv/deployment/changeprop/deploy/.git/DEPLOY_HEAD
18:06:15 ['/usr/bin/deploy-local', '-v', '--repo', 'changeprop/deploy', '--force', '-g', 'canary', 'promote'] on scb1001.eqiad.wmnet returned [70]: 18:06:15 INFO     - Starting new HTTP connection (1): deployment.eqiad.wmnet

deploy_changeprop/deploy_promote: 100% (ok: 0; fail: 1; left: 0)                
18:06:15 1 targets had deploy errors
Stage 'promote' failed on group 'canary'. Perform rollback? [y]: n
18:06:27 Finished deploy_changeprop/deploy (duration: 00m 13s)

What I gather here is that deploy-local is trying to connect to deployment.eqiad.wmnet. However, the service's scap.cfg says:

[global]
git_repo: changeprop/deploy
git_deploy_dir: /srv/deployment
git_repo_user: deploy-service
ssh_user: deploy-service
server_groups: canary, default
canary_dsh_targets: changeprop-canary
dsh_targets: changeprop
git_submodules: True
service_name: changeprop
service_port: 7272
lock_file: /tmp/scap.changeprop.lock

[wmnet]
git_server: tin.eqiad.wmnet

Event Timeline

mobrovac created this task.Mar 25 2016, 6:12 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 25 2016, 6:12 PM
mobrovac triaged this task as High priority.Mar 25 2016, 11:53 PM

It turned out there was a bug in the deploy repository. This has since been fixed, but the above error persists. So, now we are in a situation where scb1001 has the correct code checked out and the good symlink is in place (the service is running there), but the same failure is still being reported, hindering the deployment to the other nodes.

I'm triaging this as High since it blocks T128463: New Service Request - Change Propagation.

For reference, the contents of the deploy's scap folder can be consulted here.

As a work-around, I have hacked the list of targets on tin so that it gets deployed to all of the hosts in one round:

diff --git a/scap/changeprop b/scap/changeprop
index f3d8b7b..0cd7f5f 100644
--- a/scap/changeprop
+++ b/scap/changeprop
@@ -1,3 +1,4 @@
+scb1001.eqiad.wmnet
 scb1002.eqiad.wmnet
 scb2001.codfw.wmnet
 scb2002.codfw.wmnet
diff --git a/scap/scap.cfg b/scap/scap.cfg
index 5b32308..0f9e3e2 100644
--- a/scap/scap.cfg
+++ b/scap/scap.cfg
@@ -3,8 +3,7 @@ git_repo: changeprop/deploy
 git_deploy_dir: /srv/deployment
 git_repo_user: deploy-service
 ssh_user: deploy-service
-server_groups: canary, default
-canary_dsh_targets: changeprop-canary
+server_groups: default
 dsh_targets: changeprop
 git_submodules: True
 service_name: changeprop

This did the trick. After manually restarting the service on all nodes, it is up.

I've even gotten a traceback this time:

mobrovac@tin:/srv/deployment/changeprop/deploy$ deploy --verbose --force
00:03:33 Started deploy_changeprop/deploy
00:03:33 Update server info
Entering 'src'
00:03:33 
== DEFAULT ==
:* scb1001.eqiad.wmnet
:* scb2002.codfw.wmnet
:* scb2001.codfw.wmnet
:* scb1002.eqiad.wmnet
00:03:33 Running remote deploy cmd ['/usr/bin/deploy-local', '-v', '--repo', 'changeprop/deploy', '--force', '-g', 'default', 'fetch']
00:03:33 Creating /srv/deployment/changeprop/deploy/.git/DEPLOY_HEAD
deploy_changeprop/deploy_fetch: 100% (ok: 4; fail: 0; left: 0)                  
00:03:37 Running remote deploy cmd ['/usr/bin/deploy-local', '-v', '--repo', 'changeprop/deploy', '--force', '-g', 'default', 'config_deploy']
00:03:37 Creating /srv/deployment/changeprop/deploy/.git/DEPLOY_HEAD
deploy_changeprop/deploy_config_deploy: 100% (ok: 4; fail: 0; left: 0)          
00:03:38 Running remote deploy cmd ['/usr/bin/deploy-local', '-v', '--repo', 'changeprop/deploy', '--force', '-g', 'default', 'promote']
00:03:38 Creating /srv/deployment/changeprop/deploy/.git/DEPLOY_HEAD
00:03:38 ['/usr/bin/deploy-local', '-v', '--repo', 'changeprop/deploy', '--force', '-g', 'default', 'promote'] on scb1002.eqiad.wmnet returned [70]: 00:03:38 INFO     - Starting new HTTP connection (1): deployment.eqiad.wmnet

00:03:38 ['/usr/bin/deploy-local', '-v', '--repo', 'changeprop/deploy', '--force', '-g', 'default', 'promote'] on scb1001.eqiad.wmnet returned [70]: 00:03:38 INFO     - Starting new HTTP connection (1): deployment.eqiad.wmnet

00:03:38 ['/usr/bin/deploy-local', '-v', '--repo', 'changeprop/deploy', '--force', '-g', 'default', 'promote'] on scb2002.codfw.wmnet returned [70]: 00:03:38 INFO     - Starting new HTTP connection (1): deployment.eqiad.wmnet
{"name": "deploy-local", "created": 1458950618.867726, "args": [], "msecs": 867.7260875701904, "filename": "cli.py", "levelno": 30, "msg": "Unhandled error:", "lineno": 211, "exc_text": "Traceback (most recent call last):\n  File \"/usr/lib/python2.7/dist-packages/scap/cli.py\", line 277, in run\n    exit_status = app.main(extra_args)\n  File \"/usr/lib/python2.7/dist-packages/scap/deploy.py\", line 79, in main\n    getattr(self, stage)()\n  File \"/usr/lib/python2.7/dist-packages/scap/deploy.py\", line 256, in promote\n    tasks.restart_service(service, user=self.context.user)\n  File \"/usr/lib/python2.7/dist-packages/scap/utils.py\", line 302, in context_wrapper\n    return func(*args, **kwargs)\n  File \"/usr/lib/python2.7/dist-packages/scap/tasks.py\", line 570, in restart_service\n    utils.sudo_check_call(user, cmd_format.format(service, 'restart'))\n  File \"/usr/lib/python2.7/dist-packages/scap/utils.py\", line 302, in context_wrapper\n    return func(*args, **kwargs)\n  File \"/usr/lib/python2.7/dist-packages/scap/utils.py\", line 399, in sudo_check_call\n    raise subprocess.CalledProcessError(proc.returncode, cmd)\nCalledProcessError: Command 'sudo /usr/sbin/service changeprop restart' returned non-zero exit status 1", "funcName": "_handle_exception", "relativeCreated": 210.8781337738037}

00:03:38 ['/usr/bin/deploy-local', '-v', '--repo', 'changeprop/deploy', '--force', '-g', 'default', 'promote'] on scb2001.codfw.wmnet returned [70]: 00:03:38 INFO     - Starting new HTTP connection (1): deployment.eqiad.wmnet

deploy_changeprop/deploy_promote: 100% (ok: 0; fail: 4; left: 0)                
00:03:38 4 targets had deploy errors
Stage 'promote' failed on group 'default'. Perform rollback? [y]: n
00:04:09 Finished deploy_changeprop/deploy (duration: 00m 36s)
thcipriani added a comment.EditedMar 26 2016, 1:25 AM

So if it's failing on promote it's got to be:

  1. Failure to swap a symlink
  2. Failure to restart a service (if one is defined)
  3. Failure of a check for a service port (if one is defined)
  4. Failure of any post-promote checks (if a checks.yaml is present)

Since fetch works and there isn't a checks.yaml, it's probably something to do with the service.

Ah: Running deploy-log on deployment.eqiad.wmnet:/srv/deployment/changeprop/deploy shows that for whatever reason service restart permissions aren't working as expected:

00:03:38 [scb2001.codfw.wmnet] Last output:
We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:
#1) Respect the privacy of others.
#2) Think before you type.
#3) With great power comes great responsibility.
sudo: no tty present and no askpass program specified
00:03:38 [scb2001.codfw.wmnet] Unhandled error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 277, in run
    exit_status = app.main(extra_args)
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 79, in main
    getattr(self, stage)()
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 256, in promote
    tasks.restart_service(service, user=self.context.user)
  File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 302, in context_wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 570, in restart_service
    utils.sudo_check_call(user, cmd_format.format(service, 'restart'))
  File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 302, in context_wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 399, in sudo_check_call
    raise subprocess.CalledProcessError(proc.returncode, cmd)
CalledProcessError: Command 'sudo /usr/sbin/service changeprop restart' returned non-zero exit status 1
00:03:38 [scb2001.codfw.wmnet] deploy-local failed: <CalledProcessError> {u'cmd': u'sudo /usr/sbin/service changeprop restart', u'output': None, u'returncode': 1}
00:03:38 [tin] [u'/usr/bin/deploy-local', u'-v', u'--repo', u'changeprop/deploy', u'--force', u'-g', u'default', u'promote'] on scb2001.codfw.wmnet returned [70]: 00:03:38 INFO     - Starting new HTTP connection (1): deployment.eqiad.wmnet

00:03:38 [tin] 4 targets had deploy errors
00:04:09 [tin] Finished deploy_changeprop/deploy (duration: 00m 36s)

Is the box that is running changeprop used for any other scap::targets that use the deploy-service user?

I am trying to think of why the deploy-service user wouldn't be able to restart the changeprop service. It seem like the permissions should be setup correctly unless scap::target is used twice with the same user because of this line:
https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/manifests/target.pp#L102

if !defined(Sudo::User["scap_${deploy_user}"]) {
    sudo::user { "scap_${deploy_user}":
        user       => $deploy_user,
        privileges => concat($privileges, $sudo_rules),
    }
}

sudo privileges would only be setup once for a given user on a given node.

I am trying to think of why the deploy-service user wouldn't be able to restart the changeprop service. It seem like the permissions should be setup correctly unless scap::target is used twice with the same user because of this line:

Indeed, that is the crux of the problem. scb[12]00[12] are used for a total of 6 services, 2 of which currently have deployment => 'scap3' in their Puppet manifests (Citoid and Changeprop). Eventually, all of them will be switched to Scap3.

Thank you @thcipriani for taking the time to look into it on your day off, really appreciate it!

Change 279717 had a related patch set uploaded (by Mobrovac):
scap::target: Allow scap's user to restart all services on a node

https://gerrit.wikimedia.org/r/279717

mobrovac claimed this task.Mar 29 2016, 3:39 PM

Change 279717 merged by Faidon Liambotis:
scap::target: Allow scap's user to restart all services on a node

https://gerrit.wikimedia.org/r/279717

mobrovac closed this task as Resolved.Apr 1 2016, 10:06 AM
mobrovac removed a project: Patch-For-Review.
mobrovac removed a subscriber: gerritbot.