Page MenuHomePhabricator

Deploy a scalable service for ACME (LetsEncrypt) certificate management
Closed, ResolvedPublic

Description

  • Core DC redundancy
    • Wildcards + SANs
    • Multiple client hosts for each cert
  • Coding + Puppetization: PoC drafted during Wikimedia Hackaton 2018. Efforts tracked in T194962
  • Deploy in both DCs
  • Live use for one prod cert

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 455153 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] Provide support in the API for different certificate save modes

https://gerrit.wikimedia.org/r/455153

Change 455159 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] [WIP] Validate challenges before pushing them to the ACME directory

https://gerrit.wikimedia.org/r/455159

Change 451867 merged by jenkins-bot:
[operations/software/certcentral@master] Refactor certcentral.certificate_management()

https://gerrit.wikimedia.org/r/451867

Change 453124 merged by jenkins-bot:
[operations/software/certcentral@master] Implement different Certificate.save() modes

https://gerrit.wikimedia.org/r/453124

Change 454045 merged by jenkins-bot:
[operations/software/certcentral@master] Certcentral integration tests

https://gerrit.wikimedia.org/r/454045

Change 454794 merged by jenkins-bot:
[operations/software/certcentral@master] Deliver certificates in every save mode

https://gerrit.wikimedia.org/r/454794

Change 454845 merged by jenkins-bot:
[operations/software/certcentral@master] Implement DNS01 challenge support

https://gerrit.wikimedia.org/r/454845

Change 455153 merged by jenkins-bot:
[operations/software/certcentral@master] Provide support in the API for different certificate save modes

https://gerrit.wikimedia.org/r/455153

Change 456110 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] ACMERequests: Remove orders/challenges after a non-recoverable error

https://gerrit.wikimedia.org/r/456110

Change 456644 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] Provide logging

https://gerrit.wikimedia.org/r/456644

I'm CCing Moritz in case he has any advice or other ideas for the below. Basically what we've got is one backend process doing maintenance tasks (generate private keys, initial self-signed cert to get stuff running, arrange for ACME-provided cert, handle renewals, etc.) and one API UWSGI process (which hands the right private keys and certificates out to authorised client hosts) that sits behind nginx (which is doing TLS termination and checking that client's certificates are signed by the Puppet CA).

One of the things that me and @Vgutierrez spoke about briefly over IRC recently was a certcentral system user. It's become clear to me that this will be a problem (my tests so far had the backend running as www-data) - the backend process can be set to run as the new certcentral user no problem, but then the private key files it generates are not readable by the API (running as www-data still). So I think we have a few options (none of which really appeal to me), in no particular order:

  • Set the group of the files to be www-data, chmod the files 640.
  • Put www-data in our certcentral system group, chmod the files 640.
  • Change modules/uwsgi/templates/initscripts/uwsgi.systemd.erb to permit UWSGI to run as a custom user (pass in as param, default to www-data), and make it run as certcentral in our case. The files can be chmod 600. Might have to check how the nginx<->uwsgi socket works in this case.
  • Maybe we're just being paranoid and we can make the backend run as www-data and drop the idea of a certcentral system user entirely. The files can be chmod 600.

With the last two options it does mean of course that if a client (which had a cert signed by our puppet CA, or was nginx being evil) managed to compromise the certcentral API they could break the system by overwriting important stuff, or replace our private keys/certs with compromised ones, though at that point you can already read in the existing private keys so I can't think why you'd do that.

I think the only other thing running as www-data should be nginx which we already have to fully trust, given that it:

  • is trusted to tell the certcentral API process the client host's identity for authorisation purposes (it handles TLS termination so has to tell us what CN the client's puppet cert had), so could trivially extract the private keys if it was evil
  • will normally ultimately be handling our private keys on client hosts anyway, with maybe a few exceptions (apache for a few misc things, exim for MXes, maybe something obscure somewhere I'm not aware of). We already have very big problems if nginx has a vulnerability at this level.

Change 457378 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] Rename certcentral_api to just api

https://gerrit.wikimedia.org/r/457378

With the two users approach (certcentral / www-data) we just stop nginx from writing in /etc/certcentral. We should also consider that certcentral will need permissions to spawn the DNS update zone script, and I don't see any reasons to let nginx do that as well. IMHO the two users approach serves best to the principle of least privilege.

That makes sense, so we're preferring one of these:

  • Set the group of the files to be www-data, chmod the files 640.
  • Put www-data in our certcentral system group, chmod the files 640.

Change 455159 merged by jenkins-bot:
[operations/software/certcentral@master] Validate challenges before pushing them to the ACME directory

https://gerrit.wikimedia.org/r/455159

Change 456110 merged by jenkins-bot:
[operations/software/certcentral@master] ACMERequests: Remove orders/challenges after a non-recoverable error

https://gerrit.wikimedia.org/r/456110

Change 456644 merged by Vgutierrez:
[operations/software/certcentral@master] Provide logging

https://gerrit.wikimedia.org/r/456644

Change 457485 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] README: provide configuration file examples

https://gerrit.wikimedia.org/r/457485

Change 457378 merged by jenkins-bot:
[operations/software/certcentral@master] Rename certcentral_api to just api

https://gerrit.wikimedia.org/r/457378

Change 457485 merged by jenkins-bot:
[operations/software/certcentral@master] README: provide configuration file examples

https://gerrit.wikimedia.org/r/457485

Got a bunch of patches open that need review, in no particular order:

Other stuff that should happen:

Change 458554 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/software/certcentral@debian] Debian packaging

https://gerrit.wikimedia.org/r/458554

Change 458554 merged by Ema:
[operations/software/certcentral@debian] Debian packaging

https://gerrit.wikimedia.org/r/458554

Change 465636 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/dns@master] Add discovery alias for certcentral

https://gerrit.wikimedia.org/r/465636

Change 465636 abandoned by Vgutierrez:
Add discovery alias for certcentral

https://gerrit.wikimedia.org/r/465636

Mentioned in SAL (#wikimedia-operations) [2018-10-10T15:59:52Z] <vgutierrez> Uploaded certcentral 0.1 to apt.wikimedia.org (stretch) - T199711

Change 468315 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] certcentral: Add first domain for testing in prod

https://gerrit.wikimedia.org/r/468315

Change 468315 merged by Vgutierrez:
[operations/puppet@production] certcentral: Add first domain for testing in prod

https://gerrit.wikimedia.org/r/468315

@Vgutierrez: I'm thinking we should close this and open a new task about improving our certcentral setup to the point where we could talk about using it for bigger things than the current list in the parent task.
We do now have live use for two prod certs (netbox and librenms), which is more than this task description asks for.

I've rearranged the structure of these tasks to be logical and this has no more open subtasks and all the items in the task description are checked off. Work continues on some of the parent tasks :)