Page MenuHomePhabricator

Create and deploy a centralized letsencrypt service
Closed, ResolvedPublic

Description

[This is aggregated up from conversations w/ @Krenair + @faidon at the 2018 hackathon]

This is in support of a few different needs for better-automated management and deployment of LetsEncrypt certs, especially to multiple public listener hosts for the same services. Slightly-edited notes from @Krenair :

  • Central LE-service hosts (1 per core DC, manual failover) that are responsible for managing the certs: talking to Lets Encrypt, authenticating cert reqs, distribution of certs/privkeys to consuming service endpoint hosts, automatic renewal in time, etc. These will know what servers should have access to which private keys.
  • For key distribution, these would run some HTTPS server that checks client CA certificates are valid, signed by puppetmaster CA, etc. and that the requesting host is authorized for the requested private key.
  • For challenge authentication, we can support one or both of:
    • HTTP Challenge: consuming service endpoints would proxy /.well-known/acme-challenge to the central LE host so it can answer challenges directly.
    • DNS Challenge: we'll write a plugin for gdnsd (and possibly some supporting scripts, depending on design), which will allow the central LE service to push challenge responses to our authdns servers. The most-basic design would be the plugin implementing dynamic TXT records with data pulled from a local file (which it watches on mtime / inotify), and a script which polls for new challenge-responses from the central LE server (or something more-complex that triggers pushes). We should try to design the pieces of this for generic re-use/integration.
    • The HTTP service is easier to initially implement, but requires the HTTP challenge-routing hacks at all endpoints and can't do wildcards. The DNS variant doesn't need the routing hacks and can do wildcards, but has more implementation work to do.
  • The central LE service is probably a daemon written in Python, which we'll open-source and try to make generic enough to be useful to other organizations.. The client-authenticating HTTPS service that distributes keys/certs to endpoints should support two APIs:
    • A generic standard file-fetching API, e.g. GET https://foo/certs/asdfxyz/{public|private}.pem .
    • An emulation of the puppet fileserver protocol, so that it's easy to puppetize these pulls like normal "file" resources in puppet, with a distinct server hostname.

The main configuration of the python daemon probably looks something like this:

<certlabel>:
   CN: <name1>
   SNI: [ <name1>, <name2>, ... ]
   AuthorizedClients :  [ <hostname1>, <hostname2>, ... ]

Example config for some known cases:

icinga:
    CN: icinga.wikimedia.org
    SNI:
        - icinga.wikimedia.org
    AuthorizedClients:
        - einsteinium.wikimedia.org
        - tegmen.wikimedia.org

secredir:
    CN: www.wikipedia.com
    SNI:
        - border-wikipedia.de
        - en-wp.com
        - en-wp.org
        - indiawikipedia.com
        - mediawiki.com
        - wikipedia.com
    AuthorizedClients:
        - secredir1001.eqiad.wmnet
        - secredir2001.codfw.wmnet

wikibase:
    CN: www.wikiba.se
    SNI:
        - *.wikiba.se
        - wikiba.se
    AuthorizedClients:
        - cp5001.eqsin.wmnet
        - cp5002.eqsin.wmnet
        - cp5003.eqsin.wmnet
        - cp5004.eqsin.wmnet
        - cp5005.eqsin.wmnet
        - cp5006.eqsin.wmnet
        - cp4027.ulsfo.wmnet
        [...]

In puppet terms, there would be a class/resource which defines such a certificate:

<somewhere that gets applied to tegmen and einsteinium>
letsencrypt::cert { 'icinga':
    CN => 'icinga.wikimedia.org',
    SNI => [ 'icinga.wikimedia.org' ],
}

The definition of letsencrypt::cert would entail creating file resources which pull from the le-service's puppet fileserver emulation to source private keys and signed public certs. Separately, some sort of letsencrypt::server class would collect the list of hosts which have applied each of the defined certs, in order to generate the central configuration file above with a correct list of authorized clients.

Details

Related Gerrit Patches:

Related Objects

Event Timeline

BBlack triaged this task as Medium priority.May 18 2018, 3:55 PM
BBlack created this task.
Restricted Application added a project: Operations. · View Herald TranscriptMay 18 2018, 3:55 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Vgutierrez moved this task from Triage to TLS on the Traffic board.May 18 2018, 4:41 PM
Krenair added a comment.EditedMay 19 2018, 1:38 PM

Some of my work on this is being impeded by T195059: Cannot add or update records under DNS zones in Horizon

Krenair added a comment.EditedMay 19 2018, 2:46 PM

https://krenair.hopto.org is running on a labs machine (specifically deployment-secureredirproto.deployment-prep.eqiad.wmflabs) - not under wmflabs.org or alexmonk.uk for various reasons.
There's a central LE service on there which uses acme_tiny to request the cert from LE, and some some scripts are used to to pull down the public and private parts (from the central service) to a place nginx will read them from. nginx is set up to proxy /.well-known/acme-challenge through to the central service. I've also got a system of fake puppet certs in there that are used to authenticate privileged requests going to the central service.

Some after-thoughts on design issues and such (I haven't looked at any code!):

  • We should look hard for a good abstract ACME library that already exists for python to help with that part.
  • In general there won't be a terribly-high level of concurrency in this daemon. We can picture reasonable maximums of ~hundreds of certs (let's assume 1K certs for argument sake) and renewals that happen on ~2 month clocks. This would give a long term average of taking action for one cert renewal every ~1.5h. Real-world is likely to be well under that maximum... So we shouldn't expend much complexity budget on concurrency frameworks and such, given the high idleness of the service.
  • The reference earlier to emulating the puppet fileserver protocol means something like this JSON HTTP interface: https://puppet.com/docs/puppet/5.5/http_api/http_file_content.html

Also:

  • We should assume by default we want all certificates to be dual-issued as ECDSA+RSA variants and served to clients in both forms (I think this basically requires doing the same thing twice over with different private key types).
  • We should look at what attributes we can/should set in the request to get any optional goodies by default, like embedding SCTs for transparency.

Also: naming bikshedding stuff: we should name/implement this as a generic ACME tool rather than LE-specific, and just make LE be the default issuing CA.

Krenair added a subscriber: akosiaris.EditedMay 20 2018, 10:19 PM

I ran into a UWSGI / Python 3 segfault today, while my central service script is running under UWSGI and calling acme_tiny - see P7141
@akosiaris has offerred to take a look, I should also have a go at using gunicorn

Based on talking to damjan in #uwsgi, this might be a difference between OpenSSL versions between binaries - at least five different binaries are listed in that backtrace. Interesting how UWSGI doesn't subprocess python but instead appears to use the CPython API to talk to it?

Some of the DNS challenge stuff we'll look at later might benefit from what I put together for T182927 - as our current acme_tiny does HTTP challenge only

So if we implement this API, how are we going to point clients at it? Seeing as it won't be a puppetmaster...

So if we implement this API, how are we going to point clients at it? Seeing as it won't be a puppetmaster...

For a generic integration that doesn't assume puppet, you'd probably have some kind of key/cert-deployment infrastructure on the terminating nodes that knows to go fetch e.g. https://acmesvc.example.org/certs/asdfxyz/private.pem to fetch new updates on renewals, etc.

For those using puppet (us, anyways!), we'd have the puppetization handle using the fileserver API. In the puppet-level example at the top, when tegmen includes this snippet in its puppetization...

<somewhere that gets applied to tegmen and einsteinium>
letsencrypt::cert { 'icinga':
    CN => 'icinga.wikimedia.org',
    SNI => [ 'icinga.wikimedia.org' ],
}

.. within the puppet manifest definition of letsencrypt::cert, it would result in creating a resource like:

file { '/etc/somewhere/private/icinga/private.pem':
    [...]
    source => 'puppet://acme1001.eqiad.wmnet/acmedata/icinga/private.pem'
}

... which combined with appropriate puppet fileserver.conf entry for acmedata results in the client puppet agent fetching the data by executing an HTTPS GET something like:

GET https://acme1001.eqiad.wmnet/puppet/v3/file_content/acmedata/icinga/private.pem?environment=prod

(plus all the other bits of the protocol so it can e.g. check the content hash and not transfer files that haven't changed, thus not triggered dependency events like restarting affected services, etc).

Krenair added a comment.EditedMay 23 2018, 8:00 PM

Ah so the trick is where we normally puppet:/// in a file source, between the second and third slashes can be a host name of the machine to get the file from. Very interesting. Will need to check how fileserver.conf interacts with that.

Krenair added a comment.EditedMay 24 2018, 11:33 PM

So I've got it serving files to a puppet client successfully. Client just has this:

file { '/etc/centralcerts/testing.public.pem':
    owner => 'root',
    group => 'root',
    mode  => '0644',
    source => 'puppet://deployment-certcentral.deployment-prep.eqiad.wmflabs/acmedata/testing/public.pem'
}

file { '/etc/centralcerts/testing.private.pem':
    owner => 'root',
    group => 'root',
    mode  => '0600',
    source => 'puppet://deployment-certcentral.deployment-prep.eqiad.wmflabs/acmedata/testing/private.pem'
}

no fileserver.conf entries or anything. Meanwhile the server jumps through some extra hoops like having to serve the file_metadata API as well as file_content, which needs Content-Type: text/pson to trick the client into working for some reason:

1$ irb
2irb(main):001:0> require 'puppet'
3=> true
4irb(main):002:0> require 'puppet/network/format_support'
5=> false
6irb(main):003:0> Puppet::Network::FormatHandler.format_to_canonical_name('application/json')
7ArgumentError: No format match the given format name or mime-type (application/json)
8 from /usr/lib/ruby/vendor_ruby/puppet/network/format_handler.rb:65:in `format_to_canonical_name'
9 from (irb):3
10 from /usr/bin/irb:11:in `<main>'
11irb(main):004:0> Puppet::Network::FormatHandler.format_to_canonical_name('text/pson')
12=> :pson
13

edit: going with yaml.dump and Content-Type: text/yaml instead:

irb(main):015:0> Puppet::Network::FormatHandler.formats
=> [:msgpack, :yaml, :s, :binary, :pson, :dot, :console]

Edit: Turns out this API is not supposed to support JSON responses for the puppet version we use, just PSON. But YAML works anyway.

Anyway, as part of my initial code I made the "oh, it's not issued yet, let's use a self-signed cert so we can get a web server up and start responding to challenges" code on the client end - obviously this is a problem for using the puppet file protocol. Is it okay to have the central API just send a self-signed cert until it has a proper one? How should we handle this case?

Separately, some sort of letsencrypt::server class would collect the list of hosts which have applied each of the defined certs, in order to generate the central configuration file above with a correct list of authorized clients.

This is looking tricky as deployment-prep appears to have reverted back to lacking puppetdb while I wasn't looking.

Random upstream problem I noticed while browsing: https://tickets.puppetlabs.com/browse/PUP-8890

Anyway, as part of my initial code I made the "oh, it's not issued yet, let's use a self-signed cert so we can get a web server up and start responding to challenges" code on the client end - obviously this is a problem for using the puppet file protocol. Is it okay to have the central API just send a self-signed cert until it has a proper one? How should we handle this case?

I think eventually we'll switch to DNS-based challenge and it won't be a big issue. Puppetization on the client hosts will just fail until the real cert is ready. When the domain is initially configured in puppet and puppetizes to the certcentral host, it can get the real cert issued without any involvement from the client host(s) and then as soon as it's ready the client hosts' puppetization will function correctly and download it.

The routed-http-challenge variant is trickier with this kind of setup, if we want to support that option. There's a real chicken-and-egg problem where the certcentral host doesn't know whether the client host (or in multi-host cases, all client hosts) has yet pulled any (self-signed?) cert and/or configured its HTTP server for routing the challenges. The best design I can think of so far would be to try to check that state via self-test of the challenge routing. Basically this means a couple of important functional rules for the HTTP challenge mode:

  1. Anytime the certcentral stuff lacks any existing good certificate (non-self-signed, un-expired) for a probably-newly-configured cert, it should first immediately generate a self-signed placeholder to serve up to end clients, before anything else. This avoids the chicken-and-egg where the end clients would continuously fail puppetization and/or fail challenge routing due to expired/bad cert.
  1. Anytime the certcentral stuff wants to do an acme challenge, it should first poll the /.well-known acme URI routing through all the authorized clients with some reasonable timeout (if the routing works, it routes back to itself, so it's easy to validate with a simple check). If any of them fail, abort the checks and delay and retry a little later until they all work (we're probably waiting on apache/nginx config puppetization of the routing itself and/or to pull in the initial self-signed cert to make it work). Once the routing validation succeeds on all the list of clients, then do a real challenge against the acme provider expecting success. (maybe arguably, we don't need 100% success here, so that we can handle cases where one or a few configured end hosts are dead due to hardware fault or whatever. Maybe make the success threshold configurable as a base value and per-cert?).

Random upstream problem I noticed while browsing: https://tickets.puppetlabs.com/browse/PUP-8890

... which they've now moved into a different project in JIRA that isn't viewable. meh.

Krenair added a comment.EditedMay 25 2018, 6:29 PM

I'm going to find out what's going on with puppet DB in T187736 (looks like Giuseppe was involved so I'm asking him), in the mean time my patch for it looked like this (completely untested and it's been a long time since I've messed with exported resources):
(I know this breaks all sorts of puppet style conventions and things, it's very much a WIP)

commit 717a2ebd1e8cfed6e9d00a6f964fdb84fd6fbd21
Author: Root <root@deployment-puppetmaster02.deployment-prep.eqiad.wmflabs>
Date:   Fri May 25 00:48:08 2018 +0000

    Let's Encrypt central service with exported resources

diff --git a/modules/role/files/secureredir/central.py b/modules/role/files/secureredir/central.py
index bec4bef0e2..b312ffa830 100644
--- a/modules/role/files/secureredir/central.py
+++ b/modules/role/files/secureredir/central.py
@@ -3,11 +3,21 @@ sys.path.append('/usr/local/sbin/') # TODO: stab stab stab
 import acme_tiny
 
 config = None
+authorised_hosts = None
 def sighup_handler(*args):
     global config
+    global authorised_hosts
+
     with open('/etc/certcentral/config.yaml') as f:
         config = yaml.safe_load(f)
 
+    temp_authorised_hosts = collections.defaultdict(list)
+    for fname in os.listdir('/etc/certcentral/conf.d'):
+        with open('/etc/certcentral/conf.d/{}'.format(fname)) as f:
+            d = yaml.safe_load(f)
+            temp_authorised_hosts[d['certname']].append(d['host'])
+    authorised_hosts = temp_authorised_hosts
+
 signal.signal(signal.SIGHUP, sighup_handler)
 sighup_handler()
 
@@ -89,7 +99,7 @@ def get_certs(certname=None, part=None, api=None):
     if certname not in config:
         return 'no such certname', 404
 
-    if client_dn not in config[certname]['AuthorizedClients']:
+    if client_dn not in authorised_hosts[certname]:
         return 'gtfo', 403
 
     fpath = '/etc/certcentral/live_certs/{}.{}'.format(certname, part)
diff --git a/modules/role/manifests/secureredir/central.pp b/modules/role/manifests/secureredir/central.pp
index 3273a1750b..23e69aaa45 100644
--- a/modules/role/manifests/secureredir/central.pp
+++ b/modules/role/manifests/secureredir/central.pp
@@ -28,6 +28,15 @@ class role::secureredir::central {
         source => 'puppet:///modules/role/secureredir/nginx.conf'
     }
 
+    file { '/etc/certcentral/conf.d':
+        owner  => 'root',
+        group  => 'root',
+        mode   => '0600',
+        ensure => directory
+    }
+
+    Role::Secureredir::Cert_authorisedhost <<||>>
+
     file { '/etc/certcentral/config.yaml':
         owner => 'www-data',
         group => 'www-data',
@@ -35,10 +44,10 @@ class role::secureredir::central {
         content => ordered_yaml({
             testing => {
                 'CN' => 'krenair.hopto.org',
-                'SNI' => ['krenair.hopto.org'],
-                'AuthorizedClients' => [
-                    'deployment-certcentral-testclient.deployment-prep.eqiad.wmflabs'
-                ]
+                'SNI' => ['krenair.hopto.org'] #,
+#                'AuthorizedClients' => [
+#                    'deployment-certcentral-testclient.deployment-prep.eqiad.wmflabs'
+#                ]
             }
         })
     }
diff --git a/modules/role/manifests/secureredir/cert.pp b/modules/role/manifests/secureredir/cert.pp
index f13d2599a5..ed8ad2f82d 100644
--- a/modules/role/manifests/secureredir/cert.pp
+++ b/modules/role/manifests/secureredir/cert.pp
@@ -1,6 +1,11 @@
 define role::secureredir::cert {
     include ::role::secureredir::cert_prereqs
 
+    @@role::secureredir::cert_authorisedhost { "${title}__${::fqdn}":
+        certname => $title,
+        hostname => $::fqdn
+    }
+
     file { "/etc/centralcerts/${title}.public.pem":
         owner  => 'root',
         group  => 'root',
diff --git a/modules/role/manifests/secureredir/cert_authorisedhost.pp b/modules/role/manifests/secureredir/cert_authorisedhost.pp
new file mode 100644
index 0000000000..1ff4bf045a
--- /dev/null
+++ b/modules/role/manifests/secureredir/cert_authorisedhost.pp
@@ -0,0 +1,14 @@
+define role::secureredir::cert_authorisedhost(
+    $certname = undef,
+    $hostname = undef
+) {
+    file { "/etc/certcentral/conf.d/authorisedhost_${title}.yaml":
+        owner => 'root',
+        group => 'root',
+        mode  => '0444',
+        content => ordered_yaml({
+            'hostname' => $hostname,
+            'certname' => $certname
+        })
+    }
+}
Krenair added a comment.EditedMay 26 2018, 1:29 PM

Getting that working was almost suspiciously easy, for something puppet-related...
Edit: Yep spoke too soon
Edit 2: Most problems caused by my untested code, looks good now. Am going to want to officially add puppet DB support to role::puppetmaster::standalone, rather than just live hacking it in

Change 435631 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] Allow PuppetDB use on standalone puppetmasters

https://gerrit.wikimedia.org/r/435631

Change 437057 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] Tighten Puppet DB access control - check client certificates

https://gerrit.wikimedia.org/r/437057

I looked at the puppetmaster apache config and noticed this line:

# If Apache complains about invalid signatures on the CRL, you can try disabling
# CRL checking by commenting the next line, but this is not recommended.
SSLCARevocationPath     /var/lib/puppet/server/ssl/crl

nginx equivalent would be ssl_crl I think
Other services that verify against the puppet CA don't check for this (and right now probably don't have a copy of this CRL anyway). Shouldn't they be?

Change 437640 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] Tighten Puppet DB access control - check client certificates

https://gerrit.wikimedia.org/r/437640

Change 437057 merged by Alexandros Kosiaris:
[operations/puppet@production] Prep to tighten PuppetDB access control - log client certificate details

https://gerrit.wikimedia.org/r/437057

Change 441991 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] [WIP] Central certificates service

https://gerrit.wikimedia.org/r/441991

Vvjjkkii renamed this task from Create and deploy a centralized letsencrypt service to prcaaaaaaa.Jul 1 2018, 1:09 AM
Vvjjkkii removed Krenair as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
Krenair renamed this task from prcaaaaaaa to Create and deploy a centralized letsencrypt service.Jul 1 2018, 2:18 AM
Krenair claimed this task.
Krenair lowered the priority of this task from High to Medium.
Krenair updated the task description. (Show Details)
Krenair added subscribers: GerritBot, Aklapper.

Per @Vgutierrez I have started developing this in a separate repository, operations/software/certcentral.git

Change 435631 merged by Bstorm:
[operations/puppet@production] Allow PuppetDB use on standalone puppetmasters

https://gerrit.wikimedia.org/r/435631

Ahm, afaict this is very different and I am likely very ignorant here... buuuuut just in case you don't know about cergen, it has abstractions for CAs and can handle certificate generation via CSRs. I don't know much about Letsencrypt or ACME, but from https://letsencrypt.org/how-it-works/ it looks like at least the certificate issuing part is familiar.

It looks like that has some code involving interacting with X509 certs, but not ACME APIs or the Puppet fileserver API. It seems to have something to do with signing certs trusted by the private puppet CA.

Yeah, its mostly for new certificate generation from CAs. Puppet CA is optional; a Letsencrypt (or ACME?) Certificate Signer class could be implemented. Not saying we should! Just saying it should be considered! I really don't have a lot of knowledge about ACME here, and the use cases of cergen vs this case might be far enough apart that its not worth combining them.

Just saying it should be considered!

Maybe if this had been suggested 3-5 months ago this could be considered. To me it sounds like the only thing in common anyway is the fact that both handle X509 certs, but I haven't looked particularly deeply into it.

Krenair moved this task from Backlog to Goals/tracking on the Acme-chief board.Oct 13 2018, 5:01 PM

Change 467968 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[labs/private@master] secret: Add authdns-certcentral dummy SSH key

https://gerrit.wikimedia.org/r/467968

Change 467968 merged by Vgutierrez:
[labs/private@master] secret: Add authdns-certcentral dummy SSH key

https://gerrit.wikimedia.org/r/467968

Change 441991 merged by Vgutierrez:
[operations/puppet@production] Central certificates service

https://gerrit.wikimedia.org/r/441991

Change 467997 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] certcentral: Add sslcert::dhparam requirement

https://gerrit.wikimedia.org/r/467997

Change 467997 merged by Vgutierrez:
[operations/puppet@production] certcentral: Add sslcert::dhparam requirement

https://gerrit.wikimedia.org/r/467997

Change 459809 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] Certcentral-authdns integration

https://gerrit.wikimedia.org/r/459809

Change 459809 merged by Vgutierrez:
[operations/puppet@production] Certcentral-authdns integration

https://gerrit.wikimedia.org/r/459809

@Vgutierrez: are we done with this task?

Krenair closed this task as Resolved.Nov 22 2018, 11:53 PM

I'm just boldly marking this as resolved but feel free to revert if you disagree