Page MenuHomePhabricator

Check 'depool' exceeded 30.0s timeout
Closed, ResolvedPublic

Description

Had a hard time deploying today. The first attempt failed with,

21:36:25 [wtp2003.codfw.wmnet] Check 'depool' exceeded 30.0s timeout
21:36:25 [wtp2020.codfw.wmnet] Check 'depool' exceeded 30.0s timeout
21:36:25 [tin] [u'/usr/bin/scap', u'deploy-local', u'-v', u'--repo', u'parsoid/deploy', u'--force', u'-g', u'default', u'promote', u'--refresh-config'] on wtp2003.codfw.wmnet returned [2]: 
21:36:25 [tin] [u'/usr/bin/scap', u'deploy-local', u'-v', u'--repo', u'parsoid/deploy', u'--force', u'-g', u'default', u'promote', u'--refresh-config'] on wtp2020.codfw.wmnet returned [2]: 
21:36:42 [tin] 2 targets had deploy errors
21:36:42 [tin] 2 targets failed
21:36:42 [tin] 2 of 40 default6 targets failed, exceeding limit

Two subsequent attempts failed at different points, but for the same reason. After which, I deployed to the remaining targets w/ scap deploy --force -l <node>

Should we maybe increase the 30s timeout, or allow for a higher failure limit?

Event Timeline

thcipriani triaged this task as Medium priority.Mar 7 2017, 2:04 PM
thcipriani moved this task from Needs triage to Services improvements on the Scap board.
thcipriani subscribed.

Should we maybe increase the 30s timeout, or allow for a higher failure limit?

Scap supports both, a few config changes can make either happen.

Command checks support a timeout (see the example here https://doc.wikimedia.org/mw-tools-scap/scap3/quickstart/setup.html#command-checks). In this instance, if depool is taking longer than 30 seconds, you can add a timeout: 60 (or higher) in the parsoid/deploy/scap/checks.yaml

/srv/deployment/parsoid/deploy/scap/checks.yaml
checks:
  depool:
    type: command
    stage: promote
    command: depool-parsoid
    timeout: 60
  repool:
    type: command
    stage: restart_service
    command: pool-parsoid

Upping your failure_limit or [group]_failure_limit can be setup in your scap.cfg, documented here: https://doc.wikimedia.org/mw-tools-scap/scap3/repo_config.html#configuring-a-git-repo

So if you wanted to allow 5% of hosts to fail (2/40):

scap.cfg
[global]
git_repo: parsoid/deploy
git_deploy_dir: /srv/deployment
git_repo_user: deploy-service
ssh_user: deploy-service
server_groups: canary, default
canary_dsh_targets: target-canary
dsh_targets: targets
group_size: 7
git_submodules: True
service_name: parsoid
service_port: 8000
lock_file: /tmp/scap.parsoid.lock
config_deploy: True
failure_limit: 5%

[wmnet]
git_server: tin.eqiad.wmnet

[deployment-prep.eqiad.wmflabs]
environment: beta
git_server: deployment-tin.deployment-prep.eqiad.wmflabs
server_groups: default
dsh_targets: betacluster

Failure limit can also be an absolute number and it can also be limited per group.

Let me know if I can help further or review config changes.

@thcipriani Thanks for the pointers. We'll discuss what's appropriate as a team. Please have a look at D588 though, since I got,

21:40:51 [wtp1014.eqiad.wmnet] Check 'repool' exceeded 30.0s timeout
21:40:51 [tin] [u'/usr/bin/scap', u'deploy-local', u'-v', u'--repo', u'parsoid/deploy', u'--force', u'-g', u'default', u'promote', u'--refresh-config'] on wtp1014.eqiad.wmnet returned [2]: 
21:40:51 [tin] 1 targets had deploy errors
21:40:51 [tin] 1 targets failed
21:40:51 [tin] 1 of 40 default4 targets failed, exceeding limit

Change 341591 had a related patch set uploaded (by Arlolra):
[mediawiki/services/parsoid/deploy] T159387: Bump up check timeouts and ensure failure_limit

https://gerrit.wikimedia.org/r/341591

Change 341591 merged by jenkins-bot:
[mediawiki/services/parsoid/deploy] T159387: Bump up check timeouts and ensure failure_limit

https://gerrit.wikimedia.org/r/341591

Arlolra claimed this task.

Deploying went much smoother today. I assume someone will look at D588 eventually.