I have tried to run a deployment from deploy2002.codfw.wmnet for a repository that was recently introduced:
$ ssh deploy2002.codfw.wmnet deploy2002$ cd /srv/deployment/releng/jenkins-deploy deploy2002$ scap deploy --environment releasing -f --limit=releases2002.codfw.wmnet 20:30:36 Started deploy [releng/jenkins-deploy@0e465ac] (releasing) 20:30:36 Deploying Rev: HEAD = 0e465ac67daad0db80ece24145ae566379ad17e3 20:30:36 Started deploy [releng/jenkins-deploy@0e465ac] (releasing): (no justification provided) 20:30:36 == DEFAULT == :* releases2002.codfw.wmnet 20:30:36 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'releng/jenkins-deploy', '--force', '-g', 'default', 'fetch', '--refresh-config'] (ran as deploy-jenkins@releases2002.codfw.wmnet) returned [255]: sign_and_send_pubkey: signing failed: agent refused operation deploy-jenkins@releases2002.codfw.wmnet: Permission denied (publickey,keyboard-interactive). 20:30:36 connection to releases2002.codfw.wmnet failed and future stages will not be attempted for this target 20:30:36 releng/jenkins-deploy: fetch stage(s): 100% (in-flight: 0; ok: 0; fail: 1; left: 0) 20:30:36 1 targets had deploy errors 20:30:36 1 targets failed 20:30:36 1 of 1 default targets failed, exceeding limit
I then tried a manual ssh authentication explicitly setting the identity file (else ssh tries all those that are in keyholder and eventually get rejected after a number of tries):
deploy2002$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -i /etc/keyholder.d/deploy_jenkins -l deploy-jenkins releases1002.eqiad.wmnet hostname -f sign_and_send_pubkey: signing failed: agent refused operation Received disconnect from 2620:0:861:107:10:64:48:17 port 22:2: Too many authentication failures Disconnected from 2620:0:861:107:10:64:48:17 port 22
Compare with deploy1002
$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -i /etc/keyholder.d/deploy_jenkins -l deploy-jenkins releases1002.eqiad.wmnet hostname -f releases1002.eqiad.wmnet $
The agent does have the key armed:
deploy2002$ /usr/local/sbin/keyholder status keyholder-agent: active keyholder-proxy: active ... - 256 xxx /etc/keyholder.d/deploy_jenkins (ED25519)
I believe the issue is the proxy hasn't been restarted and thus does not take in account the group permission which I have been applied since it last started:
deploy2002$ systemctl status keyholder-proxy.service Active: active (running) since Thu 2022-09-29 14:30:03 UTC; 5 months 8 days ago
When on the old primary it got restarted two days ago:
Active: active (running) since Mon 2023-03-06 09:50:45 UTC; 2 days ago
Thus I guess we need to restart the keyholder-proxy.service on deploy2002.
The releng/jenkins-deploy deployment repository has been added recently