Deploy failed on wtp2017.codfw.wmnet
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Arlolra
	Oct 25 2016, 8:29 PM

Description

In T149012, we saw that deploying to wtp2017.codfw.wmnet failed ,

20:29:29 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'parsoid/deploy', '-g', 'default4', 'promote', '--refresh-config'] on wtp2017.codfw.wmnet returned [1]:

Looking at the logs on deployment.eqiad.wmnet in /srv/deployment/parsoid/deploy/scap/log,

./scap-sync-2016-10-24-0001.log:{"name": "target.wtp2017.codfw.wmnet.checks", "created": 1477340969.825173, "args": [], "msecs": 825.1729011535645, "filename": "checks.py", "levelno": 30, "msg": "Check 'repool' failed: Pooling wtp2017.codfw.wmnet from service=parsoid...\nERROR:conftool:Error when trying to set/pooled=yes on service=parsoid,name=wtp2017.codfw.wmnet\nERROR:conftool:Failure writing to the kvstore: Backend error: Raft Internal Error : etcdserver: request timed out\n", "host": "wtp2017.codfw.wmnet", "lineno": 70, "exc_text": null, "funcName": "handle_failure", "relativeCreated": 35622.64394760132}

Not sure what to make of that. Is it transient? What should I do if I encounter something like that again? In this case, I just removed that target and redeployed, which doesn't seem great.

Can I try scap deploy -l wtp2017.codfw.wmnet now? (Assuming that does what I think it does.)

Sorry for the naiveté.

Details

	Subject	Repo	Branch	Lines +/-
	Use the new pooling and depooling scripts	mediawiki/services/parsoid/deploy	master	+5 -2

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• mobrovac	T149115 Deploy failed on wtp2017.codfw.wmnet
		Resolved		None	T149668 Smart-merge checks for different environments

Event Timeline

Arlolra created this task.Oct 25 2016, 8:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 25 2016, 8:29 PM

Arlolra added a subscriber: • mobrovac.Oct 25 2016, 8:30 PM

scap deploy-log -f scap/log/scap-sync-2016-10-24-0001.log

Will give you better error output. It looks like the repooling check failed on wtp2017:

20:29:29 [wtp2017.codfw.wmnet] Check 'repool' failed: Pooling wtp2017.codfw.wmnet from service=parsoid...
ERROR:conftool:Error when trying to set/pooled=yes on service=parsoid,name=wtp2017.codfw.wmnet
ERROR:conftool:Failure writing to the kvstore: Backend error: Raft Internal Error : etcdserver: request timed out

20:29:29 [mira] [u'/usr/bin/scap', u'deploy-local', u'-v', u'--repo', u'parsoid/deploy', u'-g', u'default4', u'promote', u'--refresh-config'] on wtp2017.codfw.wmnet returned [1]: 
20:29:29 [mira] 1 targets had deploy errors

I'm not sure how to troubleshoot that. It's failing on the command pool service=parsoid if that's helpful (it's defined in scap/checks.yaml.

To deploy to only wtp2017, first: add wtp2017 back to the targets list (I saw it was removed) and then use: scap deploy -v -l wtp2017.codfw.wmnet

You may need to use the --force flag since the revision was deployed (unless it was rolled back), so: scap deploy -v --force -l wtp2017.codfw.wmnet

Feel free to ping me in IRC.

Hmm, so this looks like the new symptom of T145518

Arlolra mentioned this in T145518: Depool / repool scripts execute successfully even when the host has not been (r|d)epooled.Oct 25 2016, 9:50 PM

Arlolra edited projects, added SRE; removed Scap.

• mobrovac added a subtask: T149668: Smart-merge checks for different environments.Nov 1 2016, 7:45 AM

The problem here are the depooling / repooling scripts used during the deploy. As part of T145518: Depool / repool scripts execute successfully even when the host has not been (r|d)epooled we have developed more robust scripts, but these cannot be used currently because these new scripts do not exist in Beta, which means that using them would make all BetaCluster deploys fail, hence this task is effectively blocked by T149668: Smart-merge checks for different environments, which ought to resolve the problem. In the meantime, I will upload an appropriate patch for the deploy repo, but let's not merge it until the blocker has been resolved.

• mobrovac triaged this task as High priority.Nov 1 2016, 7:51 AM

Change 319039 had a related patch set uploaded (by Mobrovac):
Use the new pooling and depooling scripts

https://gerrit.wikimedia.org/r/319039

gerritbot added a project: Patch-For-Review.Nov 1 2016, 7:51 AM

@mobrovac Can I dirty my local tree with those changes (command: depool-parsoid) when deploying tomorrow? Getting through a deploy cleanly hasn't been possible.

Yup, @Arlolra, that should be just fine, just don't checkout or cherry-pick that commit, as it will make Scap go crazy.

Well, anecdotally, that seemed to help. Unfortunately, it still didn't get through cleanly, and in my haste I failed to note the issue. It seems like scap-sync-2016-11-02-0001.log was overwritten when I reran scap deploy? (or, at least searching for "fail" isn't turning up anything)

thcipriani closed subtask T149668: Smart-merge checks for different environments as Resolved.Nov 10 2016, 8:45 PM

Change 319039 merged by Mobrovac:
Use the new pooling and depooling scripts

https://gerrit.wikimedia.org/r/319039

The new scripts will now officially be used, resolving.

Deploy failed on wtp2017.codfw.wmnetClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Deploy failed on wtp2017.codfw.wmnet
Closed, ResolvedPublic
Actions

Related Objects
Search...