Page MenuHomePhabricator

Sort out letsencrypt puppetization for simple public hosts
Closed, ResolvedPublic

Description

Breaking out from the initial experiment on carbon, as this is really a separate sub-task.

It seems to be a good idea to get away from the more- bulky/dangerous official client and use https://github.com/diafygi/acme-tiny

After reading its code and playing with how this would puppetize here, I've come up with this half-a-plan, and some open questions:

Half a plan

  1. Create puppetization somewhere like sslcert::letsencrypt::simple, which is intended to handle the simple case of a public HTTPS server like carbon.wikimedia.org, where we don't have to deal with complexities like loadbalancers, proxies, multiple servers, lack of direct access to the internet, lack of self-verification, or lack of an existing webroot that can serve files.
  2. Package/backport acme-tiny and install via puppet (it's so tiny we could even just throw the file in puppet for a first draft).
  3. Create a wrapper script for creating the cert/privkey, which basically steps through:
    • Generate an account key
    • Generate an SSL privkey
    • Generate a CSR, using CN/SAN from parameters
    • invoke acme-tiny, outputting to appropriate webroot from parameters for challenge stuff
    • install the cert (including chain issues)
  4. Have puppet run this if the intended private key file doesn't exist, to create it for the first time
  5. Puppetize a similar simpler script for auto-renewal via cron. Note acme-tiny docs suggest saving the original CSR, and acme-tiny does not check expiry, so it's not the kind of thing you'd cron once a day in that form. We could wrap this with an expiry checker and do it once a day, though.
  6. Puppetize monitoring the expiry of the cert, so that we'll get an alert if it gets down to, say, 15 days remaining without a successful renewal.

Open Questions

  1. Account Keys and Security:

If we dynamically generate the Account key when necessary on cert creation above, it's local-only. If the server dies, we lose the account key and the ability to revoke the issued cert. As long as there's no compromise, it doesn't matter: if e.g. a server's disks fail, and we re-install it, and post-reinstall request a brand-new key using a brand-new account key, everything works fine anyways.

The next logical thought, though is that it might matter if someone steals the SSL private key off the box and also deletes our only copy of the Account key, so that we can't revoke their use of the stolen SSL private key.

However, when you think about it, even if we had the Account key backed up securely elsewhere, and didn't keep a live copy on the server (which would be tricky for renewals?), we'd still be facing the same problem. Once they've rooted our server, they can just as easily create a brand-new account key themselves and go through ACME registration to create their own new private key for our hostname. They can do that today even if we're not using LE here at all ourselves. LE's mere existence creates this risk, and there's always going to be the chance that a smart attacker who roots one of our public boxes can create certificates for any hostnames mapped to it that are valid for 90 days, and we might have to deal with the issue manually.

So my net take on this is the tradeoff is in favor of not backing up anything, and not bothering trying to delete/hide an account key from local root either. It's far simpler, and all that other stuff isn't buying us any solid improvement in real security. I'm just not really sure that I've fully thought that through. Input welcome!

  1. Chicken-and-egg when no existing valid cert is in place:

The big gaping functional hole in the half-plan is this: if the nginx (for example) configuration is puppetized to listen on port 443 using the output path we're going to place the LE cert/key in, but we haven't generated the key for the first time yet, nginx won't start up, and thus we're not serving the webroot over port 80 either for the challenge.

One way to resolve this would be to generate a self-signed cert and install that first to get the server up and running, then replace it and reload the config after generation of the real cert. Basically it would break up the cert generation process into this sort of dependency chain: generate the local files (account key, ssl key, CSR) -> generate a temporary self-signed cert -> configure/start the webserver -> fetch the real signed key with acme-tiny -> reload the server.

Another way would be to have the cert generation process/script actually launch a minimal server config (e.g. nginx -c /tmp/acme-server.conf) just for ACME validation. We'd have to puppetize this such that if the signed cert is missing, we stop the real webserver (or make dependencies work such that it was never started in the first place), do the ACME fetch with the tiny temporary server to generate the SSL files, and then start the real server.

The upside of this is that it would extend the utility of this sslcert::letsencrypt::simple puppetization to services which don't have a simple apache/nginx config with a static files webroot like carbon, and let it work for other random HTTP service daemons like a java process directly on port 80. Downside is it requires stopping the main service for duration of the renewal challenge (very brief, should be once every 60 days).

Related Objects

StatusSubtypeAssignedTask
Resolved ema
DeclinedNone
OpenNone
ResolvedBBlack
ResolvedBBlack
ResolvedArielGlenn
ResolvedChmarkine
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
Resolved CCogdill_WMF
DeclinedBBlack
DuplicateBBlack
ResolvedBBlack
ResolvedNone
DeclinedKrinkle
ResolvedJgreen
ResolvedChmarkine
ResolvedBBlack
ResolvedKrenair
ResolvedJgreen
ResolvedRobH
DuplicateNone
ResolvedBBlack
InvalidNone
ResolvedBBlack
ResolvedDzahn
Resolved ezachte
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedDzahn
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedKrenair
ResolvedNone
ResolvedBBlack

Event Timeline

Some notes on my ideas on how to do this for tool labs are at T122403: tool labs: provide custom domain proxy?, but that's a more complex scenario than a simple single webserver.

Edited description - having to stop the existing service is a problem for renewals, we still have a challenge to do there. Also, we could support nginx/apaches that don't have a simple static webroot to use (in their normal config) by exporting a config snippet for the /.well-known/ stuff. We'll probably need a few subclasses here for dealing with these cases in general:

  1. Simple server with existing static docroot (easiest)
  2. Simple nginx or apache server without a static docroot to use, can export a config snippet to map /.well_known/ stuff for challenge
  3. Other generic port 80 service - just need to know the service name, and stop/start it around issue/renewal process using a minimal nginx/apache challenge config (tiny outage every 60 days)

Rough notes from thinking about implementation more:

# id = uniq id for this cert, e.g. puppet $title
# names = foo.wm.o[,bar.wm.o[,baz...]]
# mode = integrated | standalone
#     * integrated is for apache/nginx, where puppetization can inject config
#       fragment for /.well-known/...
#     * standalone for any other webservice (e.g. java), causes a small
#       downtime on each cert renewal (every 60d or so)
# service = nginx | apache | java-foo | ...

# data files in /etc/wle/<id>/

# wle-setup.py <id> <names>
# ------------------------------
# is_sane {
#     Sanity-check files in this id's dir:
#     acct_key, priv_key, csr, cert
#     ^ all exist, right newer than left
#     cert unexpired
# }
#
# if(not is_sane()) {
#    wipe the files
#    gen acct_key
#    gen priv_key
#    gen csr
#    gen self-signed cert, 180d
# }

# wle-sign.py <id> <names> <mode> <service>
# ------------------------------
# if(cert expires < 31d || cert self-signed) {
#    if mode == integrated:
#       start service if not running
#       acme-tiny to create real cert
#       reload service
#    if mode == standalone:
#       stop service
#       configure->start micro-acme-http-server
#       acme-tiny to create real cert
#       stop micro-acme-http-server
#       start service
# }

# puppetization:
# exec wle-setup.py before Service[webserver]
# exec wle-sign.py after Service[webserver]

https://github.com/aloyr/acme-tiny-automator

"automates deployment of letsencrypt certs using acme-tiny library

This relies on the acme-tiny library, does not need to be run as root, does not install anything on the server, other than the acme-tiny.py file."

It's similar in scope to what's going on in my paste above, but it's still missing a few bits we'll need on the webserver config chicken/egg thing even for the simplest cases. I imagine the scenario that was written for was to expect an OS-default nginx/apache config before the script executes for the first time, and then apply custom (w/ SSL, and static webroot/.well-known bits) config after it runs the first time.

Change 283761 had a related patch set uploaded (by Dzahn):
create sslcert::letsencrypt::simple, install acme-tiny

https://gerrit.wikimedia.org/r/283761

Package/backport acme-tiny and install via puppet (it's so tiny we could even just throw the file in puppet for a first draft).

yep, https://gerrit.wikimedia.org/r/#/c/283761/1

Change 283763 had a related patch set uploaded (by Dzahn):
add role for hosts with LE certs, add on carbon

https://gerrit.wikimedia.org/r/283763

Change 283988 had a related patch set uploaded (by BBlack):
basic acme-setup script acme::init

https://gerrit.wikimedia.org/r/283988

akosiaris triaged this task as Medium priority.Apr 20 2016, 11:19 AM

Change 283763 abandoned by Dzahn:
add role for hosts with LE certs, add on carbon

https://gerrit.wikimedia.org/r/283763

Change 283761 merged by Dzahn:
create letsencrypt module, install acme-tiny

https://gerrit.wikimedia.org/r/283761

Change 283988 merged by BBlack:
letsencrypt module guts acme-setup script

https://gerrit.wikimedia.org/r/283988

Change 285416 had a related patch set uploaded (by BBlack):
LE: include rather than require sslcert

https://gerrit.wikimedia.org/r/285416

Change 285416 merged by BBlack:
LE: include rather than require sslcert

https://gerrit.wikimedia.org/r/285416

Change 285419 had a related patch set uploaded (by BBlack):
LE: fix "creates" path on first exec

https://gerrit.wikimedia.org/r/285419

Change 285419 merged by BBlack:
LE: fix "creates" path on first exec

https://gerrit.wikimedia.org/r/285419

Status Update: letsencrypt::cert::integrated seems to work as expected, and is managing 3x LE certs on carbon with automatic provisioning and renewal (no humans are harmed in the making of these simple outputs of math functions). Leaving this ticket open a little longer until we template + test the same on an apache (rather than nginx) integrated/public host example (should be reasonably easy), and test multi-hostname SANs in a real example too.

Change 285440 had a related patch set uploaded (by BBlack):
LE: add apache config/example

https://gerrit.wikimedia.org/r/285440

Change 285441 had a related patch set uploaded (by BBlack):
ganglia: use LE cert

https://gerrit.wikimedia.org/r/285441

Change 285442 had a related patch set uploaded (by BBlack):
ganglia: remove old cert absent line

https://gerrit.wikimedia.org/r/285442

Change 285440 merged by BBlack:
LE: add apache config/example

https://gerrit.wikimedia.org/r/285440

Change 285572 had a related patch set uploaded (by BBlack):
rt.wm.o: use LE cert

https://gerrit.wikimedia.org/r/285572

Change 285573 had a related patch set uploaded (by BBlack):
rt.wm.o: remove old cert definition

https://gerrit.wikimedia.org/r/285573

Change 285441 abandoned by BBlack:
ganglia: use LE cert

Reason:
using rt for LE apache test instead

https://gerrit.wikimedia.org/r/285441

Change 285442 abandoned by BBlack:
ganglia: remove old cert absent line

Reason:
using rt for LE apache test instead

https://gerrit.wikimedia.org/r/285442

Converted rt.wm.o, so now we have 1x apache + 1x nginx converted. Next I'm going to switch ubuntu+mirrors (both on carbon) to a single shared SAN cert, to test the SAN bits out a bit.

Change 285573 merged by BBlack:
rt.wm.o: remove old cert definition

https://gerrit.wikimedia.org/r/285573

BBlack claimed this task.

SAN test worked as well. We'll likely have more refinement and bugfixes to deal with later when we start spreading the usage of this, but it's good enough for now!