Page MenuHomePhabricator

Beta cluster doesn’t update since ca. 2019-10-15 21:00 UTC
Closed, ResolvedPublic

Description

beta-scap-eqiad is “Waiting for next available executor on ‘deployment-deploy01” (but none of the available executors have the required “BetaClusterBastion” label), and the last build finished 19 hours ago. beta-mediawiki-config-update-eqiad is blocked on beta-scap-eqiad, which manifests itself as some long-delayed builds in Zuul:

Screenshot_2019-10-16 Zuul Status - Integration.png (426×390 px, 34 KB)

Event Timeline

hashar claimed this task.
hashar subscribed.

Fixed it on spot. I have canceled the queued builds in Jenkins which eventually unblock whatever deadlock occur.

Mentioned in SAL (#wikimedia-releng) [2019-10-16T17:44:30Z] <James_F> Marking deployment-deplog01 offline temporarily for T235674

Jobs populating correctly, but failing with:

17:49:35 sudo -u mwdeploy -n -- /usr/bin/scap cdb-rebuild on deployment-mediawiki-09.deployment-prep.eqiad.wmflabs returned [255]: Permission denied (publickey).

Jdforrester-WMF reassigned this task from hashar to Krenair.
Jdforrester-WMF added a subscriber: Krenair.

Fixed by @Krenair re-doing the keyholder configuration.

elukey subscribed.

I am currently failing to deploy on deployment-deploy05:

elukey@deployment-deploy01:/srv/deployment/eventlogging/analytics$ scap deploy -e beta
16:09:01 Started deploy [eventlogging/analytics@a69acbe] (beta)
16:09:01 Deploying Rev: HEAD = a69acbe2b155ee6cdd0056db81ae7685430e5072
16:09:01 Started deploy [eventlogging/analytics@a69acbe] (beta): (no justification provided)
16:09:01
== DEFAULT ==
:* deployment-eventlog05.deployment-prep.eqiad.wmflabs
16:09:02 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'eventlogging/analytics', '-g', 'default', 'fetch', '--refresh-config'] on deployment-eventlog05.deployment-prep.eqiad.wmflabs returned [255]: Permission denied (publickey).

16:09:02 connection to deployment-eventlog05.deployment-prep.eqiad.wmflabs failed and future stages will not be attempted for this target
eventlogging/analytics: fetch stage(s): 100% (ok: 0; fail: 1; left: 0)
16:09:02 1 targets had deploy errors
16:09:02 1 targets failed
16:09:02 1 of 1 default targets failed, exceeding limit
Rollback all deployed groups? [Y/n]: n
16:09:04 Finished deploy [eventlogging/analytics@a69acbe] (beta): (no justification provided) (duration: 00m 02s)
16:09:04 Finished deploy [eventlogging/analytics@a69acbe] (beta) (duration: 00m 02s)

And SSH_AUTH_SOCK returns the same:

elukey@deployment-deploy01:~$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l analytics_deploy  deployment-eventlog05.deployment-prep.eqiad.wmflabs
Permission denied (publickey).

Anything that I can do to unblock?

Sounds like Keyholder is misbehaving again:

/etc/keyholder.d/eventlogging is not an acceptable key. Is it an RSA or ED25519 key with passphrase?

Mentioned in SAL (#wikimedia-releng) [2019-10-17T20:48:03Z] <hauskater> Arm eventlogging key via keyholder for beta cluster following the instructions on Wikitech for T235674

@elukey It looks the above worked:

maurelio@deployment-deploy01:/srv/deployment/eventlogging/analytics$ scap deploy -e beta
20:52:15 Started deploy [eventlogging/analytics@a69acbe] (beta)
20:52:15 Deploying Rev: HEAD = a69acbe2b155ee6cdd0056db81ae7685430e5072
20:52:15 Started deploy [eventlogging/analytics@a69acbe] (beta): (no justification provided)
20:52:15
== DEFAULT ==
:* deployment-eventlog05.deployment-prep.eqiad.wmflabs
eventlogging/analytics: fetch stage(s): 100% (ok: 1; fail: 0; left: 0)
eventlogging/analytics: config_deploy stage(s): 100% (ok: 1; fail: 0; left: 0)
eventlogging/analytics: promote stage(s): 100% (ok: 1; fail: 0; left: 0)
20:52:18
== DEFAULT ==
:* deployment-eventlog05.deployment-prep.eqiad.wmflabs
eventlogging/analytics: finalize stage(s): 100% (ok: 1; fail: 0; left: 0)
20:52:19 Finished deploy [eventlogging/analytics@a69acbe] (beta): (no justification provided) (duration: 00m 03s)
20:52:19 Finished deploy [eventlogging/analytics@a69acbe] (beta) (duration: 00m 03s)

Summary of findings and actions re. eventlogging (mwdeploy was fixed by @Krenair yesterday):

I'd like first to thank @Dzahn for helping me through the process. This was caused by etc/keyholder.d/eventlogging private key not having a password set. keyholder arm seems to refuse not password-protected keypairs. The solution was to run sudo ssh-keygen -p -f eventlogging and set a password for the keypair, then sudo keyholder arm and when prompted by the keyholder service, enter the password for eventlogging. That caused keyholder to be happy as seen in the message above (scap did worked this time).

However puppet reverted the addition of the password to the etc/keyholder.d/eventlogging key so @Dzahn suggested that I add one in the labs/private repo so it does not get erased, and thus we don't have to repeat the same process for when the keyholder service is rebooted.

I must note that this same problem exists for all keys at etc/keyholder.d so I wonder if we should be doing this for all listed keys there, or just for this keypair. Pinging @thcipriani for this given that I've seen SAL entries from him dealing with this kind of stuff in the past.

Summary of findings and actions re. eventlogging (mwdeploy was fixed by @Krenair yesterday):

I'd like first to thank @Dzahn for helping me through the process. This was caused by etc/keyholder.d/eventlogging private key not having a password set. keyholder arm seems to refuse not password-protected keypairs. The solution was to run sudo ssh-keygen -p -f eventlogging and set a password for the keypair, then sudo keyholder arm and when prompted by the keyholder service, enter the password for eventlogging. That caused keyholder to be happy as seen in the message above (scap did worked this time).

However puppet reverted the addition of the password to the etc/keyholder.d/eventlogging key so @Dzahn suggested that I add one in the labs/private repo so it does not get erased, and thus we don't have to repeat the same process for when the keyholder service is rebooted.

I must note that this same problem exists for all keys at etc/keyholder.d so I wonder if we should be doing this for all listed keys there, or just for this keypair. Pinging @thcipriani for this given that I've seen SAL entries from him dealing with this kind of stuff in the past.

IIRC keyholder has a feature flag for whether or not to require passwords for ssh keys. Maybe that flag became unset for beta somehow?

I found the thing I half remembered:

thcipriani@deployment-deploy01:~$ sudo cat /etc/keyholder-auth.d/keyholder.conf 
REQUIRE_ENCRYPTED_KEYS='yes'

Just need to figure out what bit of puppet controls that.

None of the proper keys are unencrypted IIRC, I think the allowing-unencrypted-keys thing is a distraction from the real problem

Change 544064 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] beta: keyholder: don't require encrypted keys

https://gerrit.wikimedia.org/r/544064

Change 544064 abandoned by Thcipriani:
beta: keyholder: don't require encrypted keys

https://gerrit.wikimedia.org/r/544064

Is this fixed? Are the keys we use encrypted now?

I don't think we figured out *why* exactly the unencrypted snakeoil keys got deployed everywhere but we sorted out the problem that this task was for.