Page MenuHomePhabricator

Deploy failed on wtp2017.codfw.wmnet
Closed, ResolvedPublic

Description

In T149012, we saw that deploying to wtp2017.codfw.wmnet failed ,

20:29:29 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'parsoid/deploy', '-g', 'default4', 'promote', '--refresh-config'] on wtp2017.codfw.wmnet returned [1]:

Looking at the logs on deployment.eqiad.wmnet in /srv/deployment/parsoid/deploy/scap/log,

./scap-sync-2016-10-24-0001.log:{"name": "target.wtp2017.codfw.wmnet.checks", "created": 1477340969.825173, "args": [], "msecs": 825.1729011535645, "filename": "checks.py", "levelno": 30, "msg": "Check 'repool' failed: Pooling wtp2017.codfw.wmnet from service=parsoid...\nERROR:conftool:Error when trying to set/pooled=yes on service=parsoid,name=wtp2017.codfw.wmnet\nERROR:conftool:Failure writing to the kvstore: Backend error: Raft Internal Error : etcdserver: request timed out\n", "host": "wtp2017.codfw.wmnet", "lineno": 70, "exc_text": null, "funcName": "handle_failure", "relativeCreated": 35622.64394760132}

Not sure what to make of that. Is it transient? What should I do if I encounter something like that again? In this case, I just removed that target and redeployed, which doesn't seem great.

Can I try scap deploy -l wtp2017.codfw.wmnet now? (Assuming that does what I think it does.)

Sorry for the naiveté.

Event Timeline

scap deploy-log -f scap/log/scap-sync-2016-10-24-0001.log

Will give you better error output. It looks like the repooling check failed on wtp2017:

20:29:29 [wtp2017.codfw.wmnet] Check 'repool' failed: Pooling wtp2017.codfw.wmnet from service=parsoid...
ERROR:conftool:Error when trying to set/pooled=yes on service=parsoid,name=wtp2017.codfw.wmnet
ERROR:conftool:Failure writing to the kvstore: Backend error: Raft Internal Error : etcdserver: request timed out

20:29:29 [mira] [u'/usr/bin/scap', u'deploy-local', u'-v', u'--repo', u'parsoid/deploy', u'-g', u'default4', u'promote', u'--refresh-config'] on wtp2017.codfw.wmnet returned [1]: 
20:29:29 [mira] 1 targets had deploy errors

I'm not sure how to troubleshoot that. It's failing on the command pool service=parsoid if that's helpful (it's defined in scap/checks.yaml.

To deploy to only wtp2017, first: add wtp2017 back to the targets list (I saw it was removed) and then use: scap deploy -v -l wtp2017.codfw.wmnet

You may need to use the --force flag since the revision was deployed (unless it was rolled back), so: scap deploy -v --force -l wtp2017.codfw.wmnet

Feel free to ping me in IRC.

Hmm, so this looks like the new symptom of T145518

mobrovac added a project: User-mobrovac.

The problem here are the depooling / repooling scripts used during the deploy. As part of T145518: Depool / repool scripts execute successfully even when the host has not been (r|d)epooled we have developed more robust scripts, but these cannot be used currently because these new scripts do not exist in Beta, which means that using them would make all BetaCluster deploys fail, hence this task is effectively blocked by T149668: Smart-merge checks for different environments, which ought to resolve the problem. In the meantime, I will upload an appropriate patch for the deploy repo, but let's not merge it until the blocker has been resolved.

Change 319039 had a related patch set uploaded (by Mobrovac):
Use the new pooling and depooling scripts

https://gerrit.wikimedia.org/r/319039

@mobrovac Can I dirty my local tree with those changes (command: depool-parsoid) when deploying tomorrow? Getting through a deploy cleanly hasn't been possible.

Yup, @Arlolra, that should be just fine, just don't checkout or cherry-pick that commit, as it will make Scap go crazy.

Well, anecdotally, that seemed to help. Unfortunately, it still didn't get through cleanly, and in my haste I failed to note the issue. It seems like scap-sync-2016-11-02-0001.log was overwritten when I reran scap deploy? (or, at least searching for "fail" isn't turning up anything)

Change 319039 merged by Mobrovac:
Use the new pooling and depooling scripts

https://gerrit.wikimedia.org/r/319039

mobrovac edited projects, added Services (done); removed Patch-For-Review.

The new scripts will now officially be used, resolving.