Page MenuHomePhabricator

scap deploy-promote fails on git push
Closed, ResolvedPublicBUG REPORT

Description

Running scap deploy-promote on deploy1002 fails on attempting a git push of the wikiversions patch:

Trace:

19:19:07 Unhandled error: 
Traceback (most recent call last):
  File "/home/brennen/scap-checkout/scap/cli.py", line 418, in run                                                                               
    exit_status = app.main(app.extra_arguments)
  File "/home/brennen/scap-checkout/scap/deploy_promote.py", line 103, in main
      exit_status = app.main(app.extra_arguments)                                                                                           [0/1864]
  File "/home/brennen/scap-checkout/scap/deploy_promote.py", line 103, in main                                                                   
    self._update_versions()                                      
  File "/home/brennen/scap-checkout/scap/deploy_promote.py", line 150, in _update_versions                                                       
    self._create_version_update_patch()
  File "/home/brennen/scap-checkout/scap/deploy_promote.py", line 157, in _create_version_update_patch                                           
    self._push_patch()                                               
  File "/home/brennen/scap-checkout/scap/deploy_promote.py", line 185, in _push_patch                                                            
    gitcmd("push", "origin", "HEAD:%s" % self._get_git_push_dest())  
  File "/home/brennen/scap-checkout/scap/runcmd.py", line 91, in gitcmd
    return _runcmd(["git", subcommand] + list(args), **kwargs)       
  File "/home/brennen/scap-checkout/scap/runcmd.py", line 78, in _runcmd
    raise FailedCommand(argv, p.returncode, stdout, stderr)          
scap.runcmd.FailedCommand: Command 'git push origin HEAD:refs/for/master%topic=1.39.0-wmf.4,l=Code-Review+2' failed with exit code 128;          
stdout:                                                  
                             
stderr:                                                   
Received disconnect from 2620:0:861:2:208:80:154:137 port 29418:2: Too many authentication failures: 7                                           
Disconnected from 2620:0:861:2:208:80:154:137 port 29418
fatal: Could not read from remote repository.               
                                                  
Please make sure you have the correct access rights
and the repository exists.                                                            
                                                      
19:19:07 deploy-promote failed: <FailedCommand> Command 'git push origin HEAD:refs/for/master%topic=1.39.0-wmf.4,l=Code-Review+2' failed with exit code 128;                                        
stdout:                                                         
                    
stderr:        
Received disconnect from 2620:0:861:2:208:80:154:137 port 29418:2: Too many authentication failures: 7                                           
Disconnected from 2620:0:861:2:208:80:154:137 port 29418                                
fatal: Could not read from remote repository.      
                                                                                    
Please make sure you have the correct access rights
and the repository exists.
                                  
19:19:07 brennen@deploy1002 /srv/mediawiki-staging (master u+1) $ which scap                                                                     
/home/brennen/scap-checkout/bin/scap

Event Timeline

brennen moved this task from Backlog to Radar on the User-brennen board.

@jnuche and I encountered the issue yesterday when he tried scap deploy-promote and we tried some debugging this morning again. I think I have found an explanation.

TLDR: scap deploy-promote push to Gerrit using the deployment server keyholder which does not have our user ssh key loaded. We should teach scap to push using our own key or have a key added to keyholder for pushing to Gerrit.

Context

/srv/mediawiki-staging is a clone of mediawiki/operations-config created by scap prep.

The remote is set to https://gerrit.wikimedia.org/r/operations/mediawiki-config

To promote wikis we craft a patch for wikiversions.json, push it to Gerrit with our own user. We do not want to push over HTTPS since that would require each of us to keep a clear text API token in our home directory, instead we require the push to happen over ssh using a local ssh key protected by a passphrase. It is not ideal, but at least prevent clear text credentials.

Since the repo remote is set to http, we use a git configuration to rewrite the url when pushing:

$HOME/.gitconfig
[url "ssh://<SHELL USERNAME>@gerrit.wikimedia.org:29418"]
	pushInsteadOf = https://gerrit.wikimedia.org/r

Issue

Running scap deploy-promote invokes git push which definitely pushes over ssh. That is confirmed by:

  1. the error message which has the Gerrit ssh port 29418:
Received disconnect from 2620:0:861:2:208:80:154:137 port 29418:2: Too many authentication failures: 7
  1. @jnuche running scap deploy-promote with GIT_TRACE=1 which causes git to output to stderr the low level command used to send the patch. Something such as:
ssh -p 29418 jnuche@gerrit.wikimedia.org 'git receive-pack operations/mediawiki-config`

Running the command directly from the terminal does work.. ssh is most probably relying on the user set SSH_AUTH_SOCK environment variable which points to a socket created when running $(eval ssh-agent).

Explanation

When running scap CLI command we setup environment variables for any command being run.

scap/cli.py
228     def _setup_environ(self):
229         """Setup shell environment."""
230         auth_sock = self.config.get("ssh_auth_sock")
231         php_version = self.config.get("php_version")
232         if php_version is not None:
233             os.environ["PHP"] = php_version
234         if auth_sock is not None and self.arguments.shared_authsock:
235             os.environ["SSH_AUTH_SOCK"] = auth_sock
236

Since they are set in os.environ, those variables will be used for any subprocess.Popen invoked unless they set an explicit environment. Thus when scap deploy-promote invokes git push the SSH_AUTH_SOCK environment variable is passed.

What is SSH_AUTH_SOCK set to would you ask? From the code above its value comes from the scap configuration ssh_auth_sock which is set in Puppet:

modules/scap/templates/scap.cfg.erb:38:ssh_auth_sock: /run/keyholder/proxy.sock

scap deploy-promote thus invoke git which invokes ssh setup to use the deployment server keyholder. None of the keys hold in keyholder are attached to our personal user accounts in our Gerrit preferences. Since none of the key match Gerrit reject the ssh key connection:

grep jnuche /var/log/gerrit/sshd_log
[2022-04-06T08:30:32.167Z] 5376e704 [SSHD] jnuche - AUTH FAILURE FROM 2620:0:861:103:10:64:32:28 no-matching-key

But ssh auth works when running from the user terminal (outside of scap):

[2022-04-06T08:32:13.372Z] 7a02e0c6 [sshd-SshDaemon[333cedbd](port=22)-nio2-thread-3] jnuche a/9885 LOGIN FROM 2620:0:861:103:10:64:32:28

Solutions

SSH_AUTH_SOCK can only be set to a single path (unlike PATH) so we cant inject the user ssh-agent socket on top of the keyholder socket. What we could do though:

short term

A) in _setup_environ above, when a SSH_AUTH_SOCK has been provided, copies it to a USER_SSH_AUTH_SOCK and use that when doing a push to gerrit:

scap/deploy_promote.py
      def _push_patch(self):
-        gitcmd("push", "origin", "HEAD:%s" % self._get_git_push_dest())
+        user_env = {}
+        user_env.update(os.environ)
+        # To push to Gerrit, use deployer agent rather than keyholder
+        user_env['SSH_AUTH_SOCK'] = os.environ['USER_SSH_AUTH_SOCK']
+        gitcmd("push", "origin", "HEAD:%s" % self._get_git_push_dest())

When the user has not SSH_AUTH_SOCK set we should not set it for git push, ssh would then use whatever key is $HOME/.ssh and prompt for the passphrase (I think).

long term

Introduce a service user in Gerrit which is intended to push the wikiversions.json bot. Maybe we could reuse the account used by the train branch bot.

Generate a ssh key pair which is hold in Puppet/SRE repository holding secrets.

Load that key into the deployment server keyholder

Change scap to git push for review as the train branch bot user. It will need the Gerrit permission Forge Author Identity since the commit author is ourselves rather than the train bot. The git push command in scap/deploy_promote.py could use the insteadOf trick:

gitcmd(
  '-c', 'url."ssh://<SHELL USERNAME>@gerrit.wikimedia.org:29418".pushInsteadOf=https://gerrit.wikimedia.org/r'
  push
)

Or alternatively when scap prep clones operations/mediawiki-config we could set the push url:

scap/plugins/prep.py
                          self._clone_or_update_repo(os.path.join(SOURCE_URL, "operations/mediawiki-config"),
                                                   self.config["operations_mediawiki_config_branch"],
                                                   self.config["stage_dir"],
                                                   logger,
                                                   )
+                         gitcmd("remote", "set-url", "--push", "origin", "ssh://gerrit.wikimedia.org/r/operations/mediawiki-config)

Which has the advantage that if one has to push from their terminal, the push url is correct. That is typically the case when doing a rollback since we do:

git revert HEAD
scap <whatever>
git push origin HEAD:refs/for/master

Change 777812 had a related patch set uploaded (by Jaime Nuche; author: Jaime Nuche):

[mediawiki/tools/scap@master] deploy-promote: create wikiversions update patch using user ssh-agent

https://gerrit.wikimedia.org/r/777812

Change 777812 merged by jenkins-bot:

[mediawiki/tools/scap@master] deploy-promote: create wikiversions update patch using user ssh-agent

https://gerrit.wikimedia.org/r/777812

And that was a quick fix. Well done!

thcipriani triaged this task as Medium priority.