Sort out letsencrypt puppetization for simple public hosts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BBlack
	Apr 15 2016, 6:13 PM

Description

Breaking out from the initial experiment on carbon, as this is really a separate sub-task.

It seems to be a good idea to get away from the more- bulky/dangerous official client and use https://github.com/diafygi/acme-tiny

After reading its code and playing with how this would puppetize here, I've come up with this half-a-plan, and some open questions:

Half a plan

Create puppetization somewhere like sslcert::letsencrypt::simple, which is intended to handle the simple case of a public HTTPS server like carbon.wikimedia.org, where we don't have to deal with complexities like loadbalancers, proxies, multiple servers, lack of direct access to the internet, lack of self-verification, or lack of an existing webroot that can serve files.
Package/backport acme-tiny and install via puppet (it's so tiny we could even just throw the file in puppet for a first draft).
Create a wrapper script for creating the cert/privkey, which basically steps through:
- Generate an account key
- Generate an SSL privkey
- Generate a CSR, using CN/SAN from parameters
- invoke acme-tiny, outputting to appropriate webroot from parameters for challenge stuff
- install the cert (including chain issues)
Have puppet run this if the intended private key file doesn't exist, to create it for the first time
Puppetize a similar simpler script for auto-renewal via cron. Note acme-tiny docs suggest saving the original CSR, and acme-tiny does not check expiry, so it's not the kind of thing you'd cron once a day in that form. We could wrap this with an expiry checker and do it once a day, though.
Puppetize monitoring the expiry of the cert, so that we'll get an alert if it gets down to, say, 15 days remaining without a successful renewal.

Open Questions

Account Keys and Security:

If we dynamically generate the Account key when necessary on cert creation above, it's local-only. If the server dies, we lose the account key and the ability to revoke the issued cert. As long as there's no compromise, it doesn't matter: if e.g. a server's disks fail, and we re-install it, and post-reinstall request a brand-new key using a brand-new account key, everything works fine anyways.

The next logical thought, though is that it might matter if someone steals the SSL private key off the box and also deletes our only copy of the Account key, so that we can't revoke their use of the stolen SSL private key.

However, when you think about it, even if we had the Account key backed up securely elsewhere, and didn't keep a live copy on the server (which would be tricky for renewals?), we'd still be facing the same problem. Once they've rooted our server, they can just as easily create a brand-new account key themselves and go through ACME registration to create their own new private key for our hostname. They can do that today even if we're not using LE here at all ourselves. LE's mere existence creates this risk, and there's always going to be the chance that a smart attacker who roots one of our public boxes can create certificates for any hostnames mapped to it that are valid for 90 days, and we might have to deal with the issue manually.

So my net take on this is the tradeoff is in favor of not backing up anything, and not bothering trying to delete/hide an account key from local root either. It's far simpler, and all that other stuff isn't buying us any solid improvement in real security. I'm just not really sure that I've fully thought that through. Input welcome!

Chicken-and-egg when no existing valid cert is in place:

The big gaping functional hole in the half-plan is this: if the nginx (for example) configuration is puppetized to listen on port 443 using the output path we're going to place the LE cert/key in, but we haven't generated the key for the first time yet, nginx won't start up, and thus we're not serving the webroot over port 80 either for the challenge.

One way to resolve this would be to generate a self-signed cert and install that first to get the server up and running, then replace it and reload the config after generation of the real cert. Basically it would break up the cert generation process into this sort of dependency chain: generate the local files (account key, ssl key, CSR) -> generate a temporary self-signed cert -> configure/start the webserver -> fetch the real signed key with acme-tiny -> reload the server.

Another way would be to have the cert generation process/script actually launch a minimal server config (e.g. nginx -c /tmp/acme-server.conf) just for ACME validation. We'd have to puppetize this such that if the signed cert is missing, we stop the real webserver (or make dependencies work such that it was never started in the first place), do the ACME fetch with the tiny temporary server to generate the SSL files, and then start the real server.

The upside of this is that it would extend the utility of this sslcert::letsencrypt::simple puppetization to services which don't have a simple apache/nginx config with a static files webroot like carbon, and let it work for other random HTTP service daemons like a java process directly on port 80. Downside is it requires stopping the main service for duration of the renewal challenge (very brief, should be once every 60 days).

Details

Subject	Repo	Branch	Lines +/-
rt.wm.o: remove old cert definition	operations/puppet	production	+0 -3
rt.wm.o: use LE cert	operations/puppet	production	+13 -5
ganglia: remove old cert absent line	operations/puppet	production	+0 -3
ganglia: use LE cert	operations/puppet	production	+13 -4
LE: add apache config/example	operations/puppet	production	+19 -1
LE: fix "creates" path on first exec	operations/puppet	production	+1 -1
LE: include rather than require sslcert	operations/puppet	production	+1 -1
letsencrypt module guts + acme-setup script	operations/puppet	production	+705 -0
create letsencrypt module, install acme-tiny	operations/puppet	production	+236 -0
add role for hosts with LE certs, add on carbon	operations/puppet	production	+6 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• ema	T108827 Investigate TCP Fast Open for tlsproxy
Declined	None	T107236 Switch port 80 to nginx on primary clusters
Open	None	T101048 Policy decisions for new (and current) DNS domains registered to the WMF
Resolved	BBlack	T104681 HTTPS Plans (tracking / high-level info)
Resolved	BBlack	T102824 Clean up DNS/redirects for TLS
Resolved	ArielGlenn	T107575 download.wiki[mp]edia.org are using an invalid certificate
Resolved	Chmarkine	T110511 sitemap.wikimedia.org uses invalid SSL certificate
Resolved	BBlack	T104244 Preload HSTS
Resolved	BBlack	T102814 Decom old multiple-subdomain wikis in wikipedia.org
Resolved	BBlack	T104942 TLS and .wap/.mobile multi-level subdomains of wikipedia.org
Resolved	BBlack	T102815 Decom www.$lang hostnames/redirects
Resolved	BBlack	T102826 Fix/decom multiple-subdomain wikis in wikimedia.org
Resolved	BBlack	T102827 Decide what to do with *.donate.wikimedia.org subdomain + TLS
Resolved	• CCogdill_WMF	T130414 delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS
Declined	BBlack	T111967 Preload HSTS for select hostnames within wikimedia.org
Duplicate	BBlack	T111998 investigate/remove hostname login.m.wikimedia.org
Resolved	BBlack	T40516 Enable HSTS on Wikimedia sites
Resolved	None	T37313 SSL cert invalid for bugzilla.wikipedia.org redirect
Declined	Krinkle	T38126 *.mobile.wikipedia.org domains are using invalid SSL certificate
Resolved	Jgreen	T88199 Enable HSTS on https://payments.wikimedia.org
Resolved	Chmarkine	T90527 Enable HSTS and point rel=canonical to HTTPS for all Russian Wikimedia projects
Resolved	BBlack	T132521 Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination
Resolved	Krenair	T133360 Fix wikitech-static TLS config
Resolved	Jgreen	T137161 Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains
Resolved	RobH	T170140 revoke benefactorevents.wikimedia.org SSL certificate
Duplicate	None	T137915 stream.wikimedia.org doesn't redirect to HTTPS
Resolved	BBlack	T105905 Switch blog to HTTPS-only
Invalid	None	T64488 Wikimedia blog has unsecured elements on https
Resolved	BBlack	T103919 let all services on misc-web enforce http->https redirects
Resolved	Dzahn	T103773 check if services behind misc-web enforce http->https redirect or not
Resolved	• ezachte	T93702 Fix the mixed content issue on Wikimedia Statistics
Resolved	BBlack	T132459 HTTPS redirects for config-master.wikimedia.org
Resolved	BBlack	T132460 HTTPS redirects for git.wikimedia.org
Resolved	BBlack	T132461 HTTPS redirects for graphite.wikimedia.org
Resolved	BBlack	T132462 HTTPS redirects for parsoid-tests.wikimedia.org
Resolved	BBlack	T132463 HTTPS redirects for datasets.wikimedia.org
Resolved	BBlack	T132464 HTTPS redirects for transparency.wikimedia.org
Resolved	BBlack	T132465 HTTPS redirects for stats.wikimedia.org
Resolved	Dzahn	T132543 enable HSTS on *.planet.wikimedia.org
Resolved	BBlack	T132452 HSTS preload for wmfusercontent.org
Resolved	BBlack	T132685 Preload STS for wikimedia.org
Resolved	BBlack	T34796 status.wikimedia.org has no (valid) HTTPS
Resolved	BBlack	T132450 enable https for (ubuntu\|apt\|mirrors).wikimedia.org
Resolved	Vgutierrez	T214253 en.wikipedia.com [sic] serves an invalid certificate
Resolved	Vgutierrez	T190244 en-wp.org certificate error
Resolved	Vgutierrez	T133548 Create a secure redirect service for large count of non-canonical / junk domains
Resolved	Krenair	T133167 Get a real (letsencrypt) cert for labtestwikitech.wikimedia.org
Resolved	None	T133717 Letsencrypt all the prod things we can - planning
Resolved	BBlack	T132812 Sort out letsencrypt puppetization for simple public hosts

Event Timeline

BBlack created this task.Apr 15 2016, 6:13 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptApr 15 2016, 6:13 PM

Some notes on my ideas on how to do this for tool labs are at T122403: tool labs: provide custom domain proxy?, but that's a more complex scenario than a simple single webserver.

Krenair mentioned this in T50501: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org.Apr 15 2016, 6:39 PM

BBlack updated the task description. (Show Details)Apr 15 2016, 6:47 PM

Edited description - having to stop the existing service is a problem for renewals, we still have a challenge to do there. Also, we could support nginx/apaches that don't have a simple static webroot to use (in their normal config) by exporting a config snippet for the /.well-known/ stuff. We'll probably need a few subclasses here for dealing with these cases in general:

Simple server with existing static docroot (easiest)
Simple nginx or apache server without a static docroot to use, can export a config snippet to map /.well_known/ stuff for challenge
Other generic port 80 service - just need to know the service name, and stop/start it around issue/renewal process using a minimal nginx/apache challenge config (tiny outage every 60 days)

scfc subscribed.Apr 15 2016, 7:07 PM

Rough notes from thinking about implementation more:

# id = uniq id for this cert, e.g. puppet $title
# names = foo.wm.o[,bar.wm.o[,baz...]]
# mode = integrated | standalone
#     * integrated is for apache/nginx, where puppetization can inject config
#       fragment for /.well-known/...
#     * standalone for any other webservice (e.g. java), causes a small
#       downtime on each cert renewal (every 60d or so)
# service = nginx | apache | java-foo | ...

# data files in /etc/wle/<id>/

# wle-setup.py <id> <names>
# ------------------------------
# is_sane {
#     Sanity-check files in this id's dir:
#     acct_key, priv_key, csr, cert
#     ^ all exist, right newer than left
#     cert unexpired
# }
#
# if(not is_sane()) {
#    wipe the files
#    gen acct_key
#    gen priv_key
#    gen csr
#    gen self-signed cert, 180d
# }

# wle-sign.py <id> <names> <mode> <service>
# ------------------------------
# if(cert expires < 31d || cert self-signed) {
#    if mode == integrated:
#       start service if not running
#       acme-tiny to create real cert
#       reload service
#    if mode == standalone:
#       stop service
#       configure->start micro-acme-http-server
#       acme-tiny to create real cert
#       stop micro-acme-http-server
#       start service
# }

# puppetization:
# exec wle-setup.py before Service[webserver]
# exec wle-sign.py after Service[webserver]

https://github.com/aloyr/acme-tiny-automator

"automates deployment of letsencrypt certs using acme-tiny library

This relies on the acme-tiny library, does not need to be run as root, does not install anything on the server, other than the acme-tiny.py file."

It's similar in scope to what's going on in my paste above, but it's still missing a few bits we'll need on the webserver config chicken/egg thing even for the simplest cases. I imagine the scenario that was written for was to expect an OS-default nginx/apache config before the script executes for the first time, and then apply custom (w/ SSL, and static webroot/.well-known bits) config after it runs the first time.

Change 283761 had a related patch set uploaded (by Dzahn):
create sslcert::letsencrypt::simple, install acme-tiny

https://gerrit.wikimedia.org/r/283761

gerritbot added a project: Patch-For-Review.Apr 16 2016, 12:06 AM

Package/backport acme-tiny and install via puppet (it's so tiny we could even just throw the file in puppet for a first draft).

yep, https://gerrit.wikimedia.org/r/#/c/283761/1

Change 283763 had a related patch set uploaded (by Dzahn):
add role for hosts with LE certs, add on carbon

https://gerrit.wikimedia.org/r/283763

JanZerebecki subscribed.Apr 16 2016, 8:15 AM

Change 283988 had a related patch set uploaded (by BBlack):
basic acme-setup script acme::init

https://gerrit.wikimedia.org/r/283988

RobH subscribed.Apr 18 2016, 4:27 PM

akosiaris triaged this task as Medium priority.Apr 20 2016, 11:19 AM

BBlack added a parent task: T133167: Get a real (letsencrypt) cert for labtestwikitech.wikimedia.org.Apr 20 2016, 4:39 PM

Change 283763 abandoned by Dzahn:
add role for hosts with LE certs, add on carbon

https://gerrit.wikimedia.org/r/283763

BBlack added a parent task: T133548: Create a secure redirect service for large count of non-canonical / junk domains.Apr 25 2016, 3:23 PM

Change 283761 merged by Dzahn:
create letsencrypt module, install acme-tiny

https://gerrit.wikimedia.org/r/283761

Change 283988 merged by BBlack:
letsencrypt module guts acme-setup script

https://gerrit.wikimedia.org/r/283988

Change 285416 had a related patch set uploaded (by BBlack):
LE: include rather than require sslcert

https://gerrit.wikimedia.org/r/285416

Change 285416 merged by BBlack:
LE: include rather than require sslcert

https://gerrit.wikimedia.org/r/285416

Change 285419 had a related patch set uploaded (by BBlack):
LE: fix "creates" path on first exec

https://gerrit.wikimedia.org/r/285419

Change 285419 merged by BBlack:
LE: fix "creates" path on first exec

https://gerrit.wikimedia.org/r/285419

Status Update: letsencrypt::cert::integrated seems to work as expected, and is managing 3x LE certs on carbon with automatic provisioning and renewal (no humans are harmed in the making of these simple outputs of math functions). Leaving this ticket open a little longer until we template + test the same on an apache (rather than nginx) integrated/public host example (should be reasonably easy), and test multi-hostname SANs in a real example too.

BBlack added a parent task: T133717: Letsencrypt all the prod things we can - planning.Apr 26 2016, 5:55 PM

Change 285440 had a related patch set uploaded (by BBlack):
LE: add apache config/example

https://gerrit.wikimedia.org/r/285440

Change 285441 had a related patch set uploaded (by BBlack):
ganglia: use LE cert

https://gerrit.wikimedia.org/r/285441

Change 285442 had a related patch set uploaded (by BBlack):
ganglia: remove old cert absent line

https://gerrit.wikimedia.org/r/285442

Change 285440 merged by BBlack:
LE: add apache config/example

https://gerrit.wikimedia.org/r/285440

Change 285572 had a related patch set uploaded (by BBlack):
rt.wm.o: use LE cert

https://gerrit.wikimedia.org/r/285572

Change 285573 had a related patch set uploaded (by BBlack):
rt.wm.o: remove old cert definition

https://gerrit.wikimedia.org/r/285573

Change 285441 abandoned by BBlack:
ganglia: use LE cert

Reason:
using rt for LE apache test instead

https://gerrit.wikimedia.org/r/285441

Change 285442 abandoned by BBlack:
ganglia: remove old cert absent line

Reason:
using rt for LE apache test instead

https://gerrit.wikimedia.org/r/285442

Change 285572 merged by BBlack:
rt.wm.o: use LE cert

https://gerrit.wikimedia.org/r/285572

Converted rt.wm.o, so now we have 1x apache + 1x nginx converted. Next I'm going to switch ubuntu+mirrors (both on carbon) to a single shared SAN cert, to test the SAN bits out a bit.

Change 285573 merged by BBlack:
rt.wm.o: remove old cert definition

https://gerrit.wikimedia.org/r/285573

SAN test worked as well. We'll likely have more refinement and bugfixes to deal with later when we start spreading the usage of this, but it's good enough for now!

Dzahn awarded a token.Apr 27 2016, 5:28 AM

rt-letsencrypt-iceweasel.png (210×469 px, 16 KB)

Dzahn added a subscriber: Mschon.Apr 27 2016, 5:34 AM

Sort out letsencrypt puppetization for simple public hostsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Sort out letsencrypt puppetization for simple public hosts
Closed, ResolvedPublic
Actions

Related Objects
Search...