Page MenuHomePhabricator

SCAP fails with Permission denied (publickey)
Closed, ResolvedPublic

Description

(Not an unbreak now because the code to be deployed is not critical)

While deploying a change for mysql configuration due to a hardware failure (T103230), all proxys and nodes failed to sync:

root@tin:/srv/mediawiki-staging$ sync-file wmf-config/db-eqiad.php "Depool db1028, return ES servers back from maintenance"
...
09:28:41 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/db-eqiad.php', 'mw1010.eqiad.wmnet', 'mw1033.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1097.eqiad.wmnet', 'mw1216.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet', 'mw2001.codfw.wmnet', 'mw2041.codfw.wmnet', 'mw2080.codfw.wmnet', 'mw2119.codfw.wmnet', 'mw2187.codfw.wmnet'] on mw2136.codfw.wmnet returned [255]: Permission denied (publickey).

sync-common: 100% (ok: 0; fail: 466; left: 0)                                   
09:28:41 466 apaches had sync errors
09:28:41 Finished sync-apaches (duration: 00m 03s)
09:28:41 Synchronized wmf-config/db-eqiad.php: Depool db1028, return ES servers back from maintenance (duration: 00m 03s)

This was a hardware maintenance deployment, and I can live without it because mediawiki detects it and depools automatically, but it would be an unbreak now in other cases.

I remember bblack talking about deployment key changes, probably related.
Related: T110791 T110793

Event Timeline

jcrespo raised the priority of this task from to High.
jcrespo updated the task description. (Show Details)
jcrespo subscribed.

Keyholder issue?

krenair@tin:~$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@mw2001
Permission denied (publickey).

Yeah, icinga has been showing this for tin's keyholder service for the past 14 hours: CRITICAL: Keyholder is not armed. Run 'keyholder arm' to arm it.
Needs ops to fix

Seems a root has to arm it with:

sudo -u keyholder env SSH_AUTH_SOCK=/run/keyholder/agent.sock ssh-add /etc/keyholder.d/mwdeploy_rsa

Change 234955 had a related patch set uploaded (by Ori.livneh):
ssh-agent-proxy: break out of select loop once client is done

https://gerrit.wikimedia.org/r/234955

Change 234955 merged by Ori.livneh:
ssh-agent-proxy: break out of select loop once client is done

https://gerrit.wikimedia.org/r/234955

ori claimed this task.

@jcrespo, I'll update the docs.