Page MenuHomePhabricator

Renew certs for mcrouter on all application servers.
Closed, ResolvedPublic

Description

Quite similar to T221346: Renew certs for mcrouter on all application servers., mw and wtp hosts in eqiad/codfw have cert expiration warnings, a mixture of:

MCROUTERCERTVERIFICATION WARNING - days_left_to_ca_expiration is 60 (outside range @~:60)
MCROUTERCERTVERIFICATION WARNING - days_left_to_client_cert_expiration is 60 (outside range @~:60)

Event Timeline

The procedure to do this is described here:

https://wikitech.wikimedia.org/wiki/Mcrouter#Renew_CA_and_certificates

it would be nice to make this a script so we don't need to paste stuff from wiki, and it's also a bit more refined.

For bonus points we could think of a way to make this rollover automated.

Mentioned in SAL (#wikimedia-operations) [2020-04-20T18:39:21Z] <rzl> disabling puppet on all mcrouter hosts for cert renewal T248093

Mentioned in SAL (#wikimedia-operations) [2020-04-20T19:40:36Z] <rzl> mcrouter certs renewed on puppetmaster1001; puppet re-enabled on mcrouter hosts and will update certs naturally over the next 30m T248093

The renewal script works as expected, but the procedure as written caused problems because not every mcrouter host is listed in /etc/cergen/mcrouter.manifests.d/mediawiki-hosts.certs.yaml on puppetmaster1001. That meant the missing hosts didn't have certs re-created by cergen, so they were deleted without replacement, which meant puppet failed on those hosts. I reverted the cert renewal, so it still needs to be done.

Those failed hosts were all new since the last time we did this. It looks like when we provision a new appserver, we generate mcrouter certs but don't add them to that config file. I'm looking into whether that's a failure of procedure or automation, and then I'll retry the cert renewal either tomorrow or next week.

In the meantime, the script is good -- I think in general it's not a bug that it was willing to remove certs, since we expect that to happen after we decommission a mcrouter host. (We could have the script prompt for confirmation in that kind of situation, but since the goal is to have it run unattended, that's not really a solution.)

Mentioned in SAL (#wikimedia-operations) [2020-04-21T18:19:48Z] <rzl> disabling puppet on all mcrouter hosts for cert renewal T248093

Mentioned in SAL (#wikimedia-operations) [2020-04-21T19:09:04Z] <rzl> mcrouter certs renewed on puppetmaster1001 (again); puppet re-enabled on mcrouter hosts and will update certs naturally over the next 30m T248093

Certs renewed! I still need to merge the script for next time, and maybe set it up to run periodically unattended, but I'm resolving this.