Page MenuHomePhabricator

Check 'depool' failed while deploying
Closed, ResolvedPublic

Description

20:14:18 [wtp1002.eqiad.wmnet] Check 'depool' failed: WARNING:etcd.client:etcd response did not contain a cluster ID
ERROR:conftool:Error when trying to set/pooled=no on service=parsoid,name=wtp1002.eqiad.wmnet
ERROR:conftool:Failure writing to the kvstore: Backend error: The request requires user authentication : Insufficient credentials

See the paste: P6018

This is probably related to T172333 where keyholder_key: deploy_service is added.

T171506 has a similar error but ssh_user: deploy-service is already set here.

Event Timeline

ssastry added a project: SRE.
ssastry subscribed.

This is blocking deployments right now.

This is probably related to T172333 where keyholder_key: deploy_service is added.

I have my doubts about this. The keyholder_key configuration value in scap refers to the ssh key to be used to login to various machines. If something were wrong with the keyholder_key configuration value you would be unable to login to the remote machines at all. Since the steps prior to the depool check running all succeed, you are able to login to the remote machines.

The specific error message in the paste output:

20:14:18 [wtp1002.eqiad.wmnet] Check 'depool' failed: WARNING:etcd.client:etcd response did not contain a cluster ID
​ERROR:conftool:Error when trying to set/pooled=no on service=parsoid,name=wtp1002.eqiad.wmnet
​ERROR:conftool:Failure writing to the kvstore: Backend error: The request requires user authentication : Insufficient credentials

I suspect something in either puppet or in conftool has changed that prevents the deploy-service user from being able to run the depool script (https://github.com/wikimedia/puppet/blob/production/modules/conftool/templates/depool.erb). Although I'm not sure what has changed where.

Change 378847 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] scap::conftool: fix home directory

https://gerrit.wikimedia.org/r/378847

This was caused by https://gerrit.wikimedia.org/r/#/c/365891/, yet another case of a labs-specific fix breaking production.

My current change should fix the situation, but we need to talk about the process that got that patch merged.

Change 378847 merged by Giuseppe Lavagetto:
[operations/puppet@production] scap::conftool: fix home directory

https://gerrit.wikimedia.org/r/378847

mobrovac edited projects, added Services (watching); removed Patch-For-Review.

Confirmed to have fixed deployments on SCB, resolving. Thank you @Joe for the quick fix!