Breaking out from the initial experiment on carbon, as this is really a separate sub-task.
It seems to be a good idea to get away from the more- bulky/dangerous official client and use https://github.com/diafygi/acme-tiny
After reading its code and playing with how this would puppetize here, I've come up with this half-a-plan, and some open questions:
Half a plan
- Create puppetization somewhere like sslcert::letsencrypt::simple, which is intended to handle the simple case of a public HTTPS server like carbon.wikimedia.org, where we don't have to deal with complexities like loadbalancers, proxies, multiple servers, lack of direct access to the internet, lack of self-verification, or lack of an existing webroot that can serve files.
- Package/backport acme-tiny and install via puppet (it's so tiny we could even just throw the file in puppet for a first draft).
- Create a wrapper script for creating the cert/privkey, which basically steps through:
- Generate an account key
- Generate an SSL privkey
- Generate a CSR, using CN/SAN from parameters
- invoke acme-tiny, outputting to appropriate webroot from parameters for challenge stuff
- install the cert (including chain issues)
- Have puppet run this if the intended private key file doesn't exist, to create it for the first time
- Puppetize a similar simpler script for auto-renewal via cron. Note acme-tiny docs suggest saving the original CSR, and acme-tiny does not check expiry, so it's not the kind of thing you'd cron once a day in that form. We could wrap this with an expiry checker and do it once a day, though.
- Puppetize monitoring the expiry of the cert, so that we'll get an alert if it gets down to, say, 15 days remaining without a successful renewal.
Open Questions
- Account Keys and Security:
If we dynamically generate the Account key when necessary on cert creation above, it's local-only. If the server dies, we lose the account key and the ability to revoke the issued cert. As long as there's no compromise, it doesn't matter: if e.g. a server's disks fail, and we re-install it, and post-reinstall request a brand-new key using a brand-new account key, everything works fine anyways.
The next logical thought, though is that it might matter if someone steals the SSL private key off the box and also deletes our only copy of the Account key, so that we can't revoke their use of the stolen SSL private key.
However, when you think about it, even if we had the Account key backed up securely elsewhere, and didn't keep a live copy on the server (which would be tricky for renewals?), we'd still be facing the same problem. Once they've rooted our server, they can just as easily create a brand-new account key themselves and go through ACME registration to create their own new private key for our hostname. They can do that today even if we're not using LE here at all ourselves. LE's mere existence creates this risk, and there's always going to be the chance that a smart attacker who roots one of our public boxes can create certificates for any hostnames mapped to it that are valid for 90 days, and we might have to deal with the issue manually.
So my net take on this is the tradeoff is in favor of not backing up anything, and not bothering trying to delete/hide an account key from local root either. It's far simpler, and all that other stuff isn't buying us any solid improvement in real security. I'm just not really sure that I've fully thought that through. Input welcome!
- Chicken-and-egg when no existing valid cert is in place:
The big gaping functional hole in the half-plan is this: if the nginx (for example) configuration is puppetized to listen on port 443 using the output path we're going to place the LE cert/key in, but we haven't generated the key for the first time yet, nginx won't start up, and thus we're not serving the webroot over port 80 either for the challenge.
One way to resolve this would be to generate a self-signed cert and install that first to get the server up and running, then replace it and reload the config after generation of the real cert. Basically it would break up the cert generation process into this sort of dependency chain: generate the local files (account key, ssl key, CSR) -> generate a temporary self-signed cert -> configure/start the webserver -> fetch the real signed key with acme-tiny -> reload the server.
Another way would be to have the cert generation process/script actually launch a minimal server config (e.g. nginx -c /tmp/acme-server.conf) just for ACME validation. We'd have to puppetize this such that if the signed cert is missing, we stop the real webserver (or make dependencies work such that it was never started in the first place), do the ACME fetch with the tiny temporary server to generate the SSL files, and then start the real server.
The upside of this is that it would extend the utility of this sslcert::letsencrypt::simple puppetization to services which don't have a simple apache/nginx config with a static files webroot like carbon, and let it work for other random HTTP service daemons like a java process directly on port 80. Downside is it requires stopping the main service for duration of the renewal challenge (very brief, should be once every 60 days).