Page MenuHomePhabricator

scap can not ssh with keyholder on deploy2002
Closed, ResolvedPublic

Description

I have tried to run a deployment from deploy2002.codfw.wmnet for a repository that was recently introduced:

$ ssh deploy2002.codfw.wmnet
deploy2002$ cd /srv/deployment/releng/jenkins-deploy
deploy2002$ scap deploy --environment releasing -f --limit=releases2002.codfw.wmnet
20:30:36 Started deploy [releng/jenkins-deploy@0e465ac] (releasing)
20:30:36 Deploying Rev: HEAD = 0e465ac67daad0db80ece24145ae566379ad17e3
20:30:36 Started deploy [releng/jenkins-deploy@0e465ac] (releasing): (no justification provided)
20:30:36 
== DEFAULT ==
:* releases2002.codfw.wmnet
20:30:36 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'releng/jenkins-deploy', '--force', '-g', 'default', 'fetch', '--refresh-config'] (ran as deploy-jenkins@releases2002.codfw.wmnet) returned [255]: sign_and_send_pubkey: signing failed: agent refused operation
deploy-jenkins@releases2002.codfw.wmnet: Permission denied (publickey,keyboard-interactive).

20:30:36 connection to releases2002.codfw.wmnet failed and future stages will not be attempted for this target
20:30:36 releng/jenkins-deploy: fetch stage(s): 100% (in-flight: 0; ok: 0; fail: 1; left: 0) 
20:30:36 1 targets had deploy errors
20:30:36 1 targets failed
20:30:36 1 of 1 default targets failed, exceeding limit

I then tried a manual ssh authentication explicitly setting the identity file (else ssh tries all those that are in keyholder and eventually get rejected after a number of tries):

deploy2002$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -i /etc/keyholder.d/deploy_jenkins -l deploy-jenkins releases1002.eqiad.wmnet hostname -f
sign_and_send_pubkey: signing failed: agent refused operation
Received disconnect from 2620:0:861:107:10:64:48:17 port 22:2: Too many authentication failures
Disconnected from 2620:0:861:107:10:64:48:17 port 22

Compare with deploy1002

$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -i /etc/keyholder.d/deploy_jenkins -l deploy-jenkins releases1002.eqiad.wmnet hostname -f
releases1002.eqiad.wmnet
$

The agent does have the key armed:

deploy2002$ /usr/local/sbin/keyholder status
keyholder-agent: active
keyholder-proxy: active
...
- 256 xxx /etc/keyholder.d/deploy_jenkins (ED25519)

I believe the issue is the proxy hasn't been restarted and thus does not take in account the group permission which I have been applied since it last started:

deploy2002$ systemctl status keyholder-proxy.service
   Active: active (running) since Thu 2022-09-29 14:30:03 UTC; 5 months 8 days ago

When on the old primary it got restarted two days ago:

Active: active (running) since Mon 2023-03-06 09:50:45 UTC; 2 days ago

Thus I guess we need to restart the keyholder-proxy.service on deploy2002.

The releng/jenkins-deploy deployment repository has been added recently

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2023-03-08T20:41:42Z] <mutante> deploy2002 - systemctl restart keyholder-proxy.service to fix T331568 - after this SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -i /etc/keyholder.d/deploy_jenkins -l deploy-jenkins releases1002.eqiad.wmnet works

Dzahn claimed this task.
Dzahn added a subscriber: Dzahn.

should be fixed. By restarting the proxy as you suggested. Test below works:

[deploy2002:~] $ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -i /etc/keyholder.d/deploy_jenkins -l deploy-jenkins releases1002.eqiad.wmnet
Linux releases1002 4.19.0-22-amd64 #1 SMP Debian 4.19.260-1 (2022-09-29) x86_64
Debian GNU/Linux 10 (buster)
Netbox Status: active
releases1002 is a Wikimedia Software Releases Server (releases)

feel free to reopen of course if not.

hashar added a subscriber: Clement_Goubert.

I have confirmed it works. I am reopening so that the Datacenter-Switchover documentation gets updated to take in account the keyholder-proxy.service has to be restarted to take in account the latest config. I guess I can pair that with @claime :)

Awesome thank you!