Page MenuHomePhabricator

Failed to rollback scap3 deployment
Closed, ResolvedPublic

Description

I just tried to deploy kartotherian service using scap3. The canary target failed (the service kept restarting), so scap3 after hanging for a while offered to rollback. I accepted (hit enter), but the git repo on the canary target stayed the same version as in the deployment dir, so the service kept restarting.

yurik@maps-test2001:/srv/deployment/kartotherian/deploy$ git log
commit 2b93211b683c277f0614e1b3f17f48d12c9b170d

yurik@tin:/srv/deployment/kartotherian/deploy$ git log
commit 2b93211b683c277f0614e1b3f17f48d12c9b170d
yurik@tin:/srv/deployment/kartotherian/deploy$ scap deploy -v
00:23:06 Started Deploy: kartotherian/deploy
00:23:06 Deploying Rev: 2b93211b683c277f0614e1b3f17f48d12c9b170d
00:23:06 {'server_groups': 'canary, default', 'statsd_host': 'statsd.eqiad.wmnet', 'master_rsync': 'deployment.eqiad.wmnet', 'service_name': 'kartotherian', 'hhvm_pid_file': '/run/hhvm/hhvm.pid', 'dsh_targets': 'targets', 'deploy_dir': '/srv/mediawiki', 'git_server': 'tin.eqiad.wmnet', 'udp2log_host': 'fluorine.eqiad.wmnet', 'git_repo_user': 'deploy-service', 'perform_checks': True, 'git_fat': False, 'bin_dir': '/usr/bin', 'ssh_auth_sock': '/run/keyholder/proxy.sock', 'apache_pid_file': '/var/run/apache2/apache2.pid', 'pybal_interface': 'lo:LVS', 'ssh_user': 'deploy-service', 'canary_dsh_targets': 'target-canary', 'canary_threshold': 10.0, 'logstash_host': 'logstash1001.eqiad.wmnet:9200', 'nrpe_dir': '/etc/nagios/nrpe.d', 'wmf_realm': 'production', 'lock_file': '/tmp/scap.kartotherian.lock', 'tcpircbot_host': 'neon.wikimedia.org', 'service_timeout': 120.0, 'git_deploy_dir': '/srv/deployment', 'patch_path': None, 'dsh_proxies': 'scap-proxies', 'git_scheme': 'http', 'git_repo': 'kartotherian/deploy', 'datacenter': 'eqiad', 'udp2log_port': '8420', 'canary_wait_time': 20, 'config_deploy': False, 'statsd_port': '8125', 'service_port': '6533', 'dsh_masters': 'scap-masters', 'dsh_api_canaries': 'mediawiki-api-canaries', 'git_submodules': True, 'dsh_app_canaries': 'mediawiki-appserver-canaries', 'git_upstream_submodules': False, 'stage_dir': '/srv/mediawiki-staging', 'tcpircbot_port': '9200', 'log_json': False}
00:23:06 Update DEPLOY_HEAD
00:23:06 Creating /srv/deployment/kartotherian/deploy/.git/DEPLOY_HEAD
00:23:06 Update server info
Entering 'src'
00:23:06 
== CANARY ==
:* maps-test2001.codfw.wmnet
00:23:06 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'kartotherian/deploy', '-g', 'canary', 'fetch']
kartotherian/deploy: fetch stage(s): 100% (ok: 1; fail: 0; left: 0)             
00:23:15 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'kartotherian/deploy', '-g', 'canary', 'config_deploy']
kartotherian/deploy: config_deploy stage(s): 100% (ok: 1; fail: 0; left: 0)     
00:23:16 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'kartotherian/deploy', '-g', 'canary', 'promote']
00:25:18 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'kartotherian/deploy', '-g', 'canary', 'promote'] on maps-test2001.codfw.wmnet returned [70]: 
kartotherian/deploy: promote and restart_service stage(s): 100% (ok: 0; fail: 1; left: 0)
00:25:18 1 targets had deploy errors
Stage 'promote' failed on group 'canary'. Perform rollback? [y]: 
00:25:36 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'kartotherian/deploy', '-g', 'canary', 'rollback']
kartotherian/deploy: rollback stage(s): 100% (ok: 1; fail: 0; left: 0)          
00:25:37 Finished Deploy: kartotherian/deploy (duration: 02m 30s)

Revisions and Commits

rMSCA Scap
Restricted Differential Revision

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
thcipriani subscribed.

So it looks like scap tried to restart the service on the canary host and then reach out the port 6533. The timeout for this check is 120 seconds, so that was probably the hanging.

The deploy-log for this deploy looks like:

$ scap deploy-log -f scap/log/scap-sync-2016-08-09-0001-3-g2b93211.log -v
...
00:24:42 [maps-test2001.codfw.wmnet] Port 6533 not up. Waiting 3.00s
00:24:45 [maps-test2001.codfw.wmnet] Port 6533 not up. Waiting 3.00s
00:24:48 [maps-test2001.codfw.wmnet] Port 6533 not up. Waiting 3.00s
00:24:51 [maps-test2001.codfw.wmnet] Port 6533 not up. Waiting 3.00s
00:24:54 [maps-test2001.codfw.wmnet] Port 6533 not up. Waiting 3.00s
00:24:57 [maps-test2001.codfw.wmnet] Port 6533 not up. Waiting 3.00s
00:25:00 [maps-test2001.codfw.wmnet] Port 6533 not up. Waiting 3.00s
00:25:03 [maps-test2001.codfw.wmnet] Port 6533 not up. Waiting 3.00s
00:25:06 [maps-test2001.codfw.wmnet] Port 6533 not up. Waiting 3.00s
00:25:09 [maps-test2001.codfw.wmnet] Port 6533 not up. Waiting 3.00s
00:25:12 [maps-test2001.codfw.wmnet] Port 6533 not up. Waiting 3.00s
00:25:15 [maps-test2001.codfw.wmnet] Port 6533 not up. Waiting 3.00s
00:25:18 [maps-test2001.codfw.wmnet] Port 6533 not up. Waiting 3.00s
00:25:18 [maps-test2001.codfw.wmnet] Unhandled error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 255, in run
    exit_status = app.main(app.extra_arguments)
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 88, in main
    getattr(self, stage)()
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 285, in restart_service
    tasks.check_port(int(port), timeout=service_timeout)
  File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 314, in context_wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 745, in check_port
    'Port {} not up within {:.2f}s'.format(port, timeout)
OSError: [Errno 107] Port 6533 not up within 120.00s
00:25:18 [maps-test2001.codfw.wmnet] deploy-local failed: <OSError> {}
00:25:18 [tin] [u'/usr/bin/scap', u'deploy-local', u'-v', u'--repo', u'kartotherian/deploy', u'-g', u'canary', u'promote'] on maps-test2001.codfw.wmnet returned [70]: 
00:25:18 [tin] 1 targets had deploy errors
00:25:36 [tin] Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'kartotherian/deploy', '-g', 'canary', 'rollback']
00:25:36 [maps-test2001.codfw.wmnet] No rollback necessary. Skipping
00:25:37 [tin] Finished Deploy: kartotherian/deploy (duration: 02m 30s)

Everything looks normal except for the No rollback necessary. piece. That should only happen if it can't find the file that links to the revision that is pointed to by the .in_progress on the remote (https://github.com/wikimedia/scap/blob/master/scap/deploy.py#L291-L294).

This means that something went wrong with /srv/deployment/kartotherian/deploy-cache/.in-progress

thcipriani added a revision: Restricted Differential Revision.Aug 12 2016, 9:11 PM

Landed commit, still unreleased. Leaving open until released.

thcipriani claimed this task.

Should be fixed in latest release.

Thank you for filing the task—bad bug, glad it's gone.