⚓ T199711 Deploy a scalable service for ACME (LetsEncrypt) certificate management

Subject	Repo	Branch	Lines +/-
certcentral: Add first domain for testing in prod	operations/puppet	production	+7 -1
Add discovery alias for certcentral	operations/dns	master	+1 -0
Debian packaging	operations/software/certcentral	debian	+225 -0
README: provide configuration file examples	operations/software/certcentral	master	+21 -0
Rename certcentral_api to just api	operations/software/certcentral	master	+4 -4
Provide logging	operations/software/certcentral	master	+71 -12
ACMERequests: Remove orders/challenges after a non-recoverable error	operations/software/certcentral	master	+9 -2
Validate challenges before pushing them to the ACME directory	operations/software/certcentral	master	+372 -45
Implement DNS01 challenge support	operations/software/certcentral	master	+184 -24
Provide support in the API for different certificate save modes	operations/software/certcentral	master	+8 -6
Deliver certificates in every save mode	operations/software/certcentral	master	+64 -12
Certcentral integration tests	operations/software/certcentral	master	+454 -139
Refactor certcentral.certificate_management()	operations/software/certcentral	master	+648 -89
Implement different Certificate.save() modes	operations/software/certcentral	master	+129 -6

Status	Assigned	Task
Invalid	None	T108946 [Epic] Improve the development infrastructure
Declined	None	T99531 [Task] move wikiba.se webhosting to wikimedia cluster
Resolved	• MasinAlDujailiWMDE	T155359 wikiba.se should use HTTPS
Resolved	Vgutierrez	T207050 Migrate most standard public TLS certificates to CertCentral issuance
Resolved	None	T199711 Deploy a scalable service for ACME (LetsEncrypt) certificate management
Resolved	Krenair	T194962 Create and deploy a centralized letsencrypt service
Resolved	BBlack	T194965 gdnsd plugin support for ACME DNS challenges
Resolved	MarcoAurelio	T198541 Set up CI for new repo operations/software/certcentral.git
Resolved	Krenair	T153577 Make standalone puppetmasters optionally use PuppetDB
Resolved	scfc	T154104 role::puppetmaster::puppetdb depends on Ganglia and cannot be used in Labs
Resolved	Vgutierrez	T207476 Create production LE accounts
Resolved	Vgutierrez	T199717 Pick up a suitable ACME library for certcentral
Resolved	Vgutierrez	T200405 Provide a CI container with pebble
Resolved	Vgutierrez	T203422 certcentral: phantom test failure around challenge success
Resolved	Vgutierrez	T203678 certcentral: Make configurable the cmd executed to perform a DNS zone update
Resolved	Vgutierrez	T206308 Create VMs for certcentral hosts
Resolved	Vgutierrez	T206461 Provide a Let's Encrypt ACME v2 staging environment account

Change 455153 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] Provide support in the API for different certificate save modes

https://gerrit.wikimedia.org/r/455153

Vgutierrez mentioned this in rOSCC82a4830f37ca: Provide support in the API for different certificate save modes.Aug 24 2018, 12:23 PM

Change 455159 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] [WIP] Validate challenges before pushing them to the ACME directory

https://gerrit.wikimedia.org/r/455159

Vgutierrez mentioned this in rOSCC7454520f06a4: [WIP] Validate challenges before pushing them to the ACME directory.Aug 24 2018, 1:11 PM

Vgutierrez mentioned this in rOSCC378c55628cfe: [WIP] Validate challenges before pushing them to the ACME directory.Aug 24 2018, 1:20 PM

Change 451867 merged by jenkins-bot:
[operations/software/certcentral@master] Refactor certcentral.certificate_management()

https://gerrit.wikimedia.org/r/451867

Change 453124 merged by jenkins-bot:
[operations/software/certcentral@master] Implement different Certificate.save() modes

https://gerrit.wikimedia.org/r/453124

Vgutierrez mentioned this in rOSCC6376af4e3894: Certcentral integration tests.Aug 27 2018, 8:20 AM

Vgutierrez mentioned this in rOSCCfba3e5217001: Deliver certificates in every save mode.

Vgutierrez mentioned this in rOSCC048b9599eca1: Implement DNS01 challenge support.

Vgutierrez mentioned this in rOSCC5ed02e7c42e4: Provide support in the API for different certificate save modes.

Vgutierrez mentioned this in rOSCC4c2c0fcdb604: [WIP] Validate challenges before pushing them to the ACME directory.

Vgutierrez mentioned this in rOSCC1da32e53e02b: Implement DNS01 challenge support.Aug 27 2018, 8:34 AM

Vgutierrez mentioned this in rOSCC8ad8b31be27b: Provide support in the API for different certificate save modes.

Vgutierrez mentioned this in rOSCCbebddae5e0d2: [WIP] Validate challenges before pushing them to the ACME directory.

Vgutierrez mentioned this in rOSCC51ce6249bef4: Deliver certificates in every save mode.Aug 27 2018, 4:52 PM

Vgutierrez mentioned this in rOSCCf2c066a7da71: Implement DNS01 challenge support.

Vgutierrez mentioned this in rOSCC29232c0261f6: Provide support in the API for different certificate save modes.

Vgutierrez mentioned this in rOSCCee4810c71cae: [WIP] Validate challenges before pushing them to the ACME directory.

Change 454045 merged by jenkins-bot:
[operations/software/certcentral@master] Certcentral integration tests

https://gerrit.wikimedia.org/r/454045

Change 454794 merged by jenkins-bot:
[operations/software/certcentral@master] Deliver certificates in every save mode

https://gerrit.wikimedia.org/r/454794

Change 454845 merged by jenkins-bot:
[operations/software/certcentral@master] Implement DNS01 challenge support

https://gerrit.wikimedia.org/r/454845

Change 455153 merged by jenkins-bot:
[operations/software/certcentral@master] Provide support in the API for different certificate save modes

https://gerrit.wikimedia.org/r/455153

Vgutierrez mentioned this in rOSCCdaf9f85006e5: [WIP] Validate challenges before pushing them to the ACME directory.Aug 28 2018, 4:06 PM

Vgutierrez mentioned this in rOSCC7fef1c91425d: [WIP] Validate challenges before pushing them to the ACME directory.Aug 29 2018, 9:13 AM

Vgutierrez mentioned this in rOSCC2b0b271b63f2: [WIP] Validate challenges before pushing them to the ACME directory.Aug 29 2018, 9:21 AM

Vgutierrez mentioned this in rOSCCf82e6f854aac: Validate challenges before pushing them to the ACME directory.Aug 29 2018, 9:53 AM

Change 456110 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] ACMERequests: Remove orders/challenges after a non-recoverable error

https://gerrit.wikimedia.org/r/456110

Vgutierrez mentioned this in rOSCC3afdab92b028: ACMERequests: Remove orders/challenges after a non-recoverable error.Aug 29 2018, 10:23 AM

Vgutierrez mentioned this in rOSCC38a90ce6d15b: ACMERequests: Remove orders/challenges after a non-recoverable error.

Change 456644 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] Provide logging

https://gerrit.wikimedia.org/r/456644

Vgutierrez mentioned this in rOSCCeae9ebbf0807: Validate challenges before pushing them to the ACME directory.Aug 31 2018, 2:45 PM

Vgutierrez mentioned this in rOSCC6dd65ac6ed33: ACMERequests: Remove orders/challenges after a non-recoverable error.

Vgutierrez mentioned this in rOSCC61b53444623e: Provide logging.

Vgutierrez mentioned this in rOSCC97437593ca2d: Validate challenges before pushing them to the ACME directory.Aug 31 2018, 2:48 PM

Vgutierrez mentioned this in rOSCCfdfab3218e72: ACMERequests: Remove orders/challenges after a non-recoverable error.

Vgutierrez mentioned this in rOSCCafef4e57ca06: Provide logging.

Vgutierrez mentioned this in rOSCC394912e750d8: Provide logging.

Krenair mentioned this in rOSCCc15e687e746e: Validate challenges before pushing them to the ACME directory.Aug 31 2018, 3:35 PM

Krenair mentioned this in rOSCC7a10207241fb: ACMERequests: Remove orders/challenges after a non-recoverable error.

Krenair mentioned this in rOSCC94f019a1811f: Provide logging.

I'm CCing Moritz in case he has any advice or other ideas for the below. Basically what we've got is one backend process doing maintenance tasks (generate private keys, initial self-signed cert to get stuff running, arrange for ACME-provided cert, handle renewals, etc.) and one API UWSGI process (which hands the right private keys and certificates out to authorised client hosts) that sits behind nginx (which is doing TLS termination and checking that client's certificates are signed by the Puppet CA).

One of the things that me and @Vgutierrez spoke about briefly over IRC recently was a certcentral system user. It's become clear to me that this will be a problem (my tests so far had the backend running as www-data) - the backend process can be set to run as the new certcentral user no problem, but then the private key files it generates are not readable by the API (running as www-data still). So I think we have a few options (none of which really appeal to me), in no particular order:

Set the group of the files to be www-data, chmod the files 640.
Put www-data in our certcentral system group, chmod the files 640.
Change modules/uwsgi/templates/initscripts/uwsgi.systemd.erb to permit UWSGI to run as a custom user (pass in as param, default to www-data), and make it run as certcentral in our case. The files can be chmod 600. Might have to check how the nginx<->uwsgi socket works in this case.
Maybe we're just being paranoid and we can make the backend run as www-data and drop the idea of a certcentral system user entirely. The files can be chmod 600.

With the last two options it does mean of course that if a client (which had a cert signed by our puppet CA, or was nginx being evil) managed to compromise the certcentral API they could break the system by overwriting important stuff, or replace our private keys/certs with compromised ones, though at that point you can already read in the existing private keys so I can't think why you'd do that.

I think the only other thing running as www-data should be nginx which we already have to fully trust, given that it:

is trusted to tell the certcentral API process the client host's identity for authorisation purposes (it handles TLS termination so has to tell us what CN the client's puppet cert had), so could trivially extract the private keys if it was evil
will normally ultimately be handling our private keys on client hosts anyway, with maybe a few exceptions (apache for a few misc things, exim for MXes, maybe something obscure somewhere I'm not aware of). We already have very big problems if nginx has a vulnerability at this level.

Change 457378 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] Rename certcentral_api to just api

https://gerrit.wikimedia.org/r/457378

Vgutierrez mentioned this in rOSCCc7cc20351408: Rename certcentral_api to just api.Sep 3 2018, 8:57 AM

With the two users approach (certcentral / www-data) we just stop nginx from writing in /etc/certcentral. We should also consider that certcentral will need permissions to spawn the DNS update zone script, and I don't see any reasons to let nginx do that as well. IMHO the two users approach serves best to the principle of least privilege.

That makes sense, so we're preferring one of these:

Set the group of the files to be www-data, chmod the files 640.
Put www-data in our certcentral system group, chmod the files 640.

Change 455159 merged by jenkins-bot:
[operations/software/certcentral@master] Validate challenges before pushing them to the ACME directory

https://gerrit.wikimedia.org/r/455159

Change 456110 merged by jenkins-bot:
[operations/software/certcentral@master] ACMERequests: Remove orders/challenges after a non-recoverable error

https://gerrit.wikimedia.org/r/456110

Change 456644 merged by Vgutierrez:
[operations/software/certcentral@master] Provide logging

https://gerrit.wikimedia.org/r/456644

Change 457485 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] README: provide configuration file examples

https://gerrit.wikimedia.org/r/457485

Vgutierrez mentioned this in rOSCCe30e154ce48b: README: provide configuration file examples.Sep 3 2018, 3:28 PM

Krenair mentioned this in rOSCC235ed3c0afd3: Rename certcentral_api to just api.Sep 4 2018, 3:41 PM

Krenair mentioned this in rOSCC1eb72c156ab5: README: provide configuration file examples.

Krenair closed subtask T203422: certcentral: phantom test failure around challenge success as Resolved.Sep 4 2018, 4:02 PM

Vgutierrez mentioned this in rOSCCc51fce4ef16f: Rename certcentral_api to just api.Sep 6 2018, 8:32 AM

Vgutierrez mentioned this in rOSCCd76406d7cd6a: README: provide configuration file examples.Sep 6 2018, 8:38 AM

Krenair mentioned this in rOSCC321e035962d8: Rename certcentral_api to just api.Sep 6 2018, 5:04 PM

Change 457378 merged by jenkins-bot:
[operations/software/certcentral@master] Rename certcentral_api to just api

https://gerrit.wikimedia.org/r/457378

Change 457485 merged by jenkins-bot:
[operations/software/certcentral@master] README: provide configuration file examples

https://gerrit.wikimedia.org/r/457485

Krenair mentioned this in rOSCC069f995a7675: README: provide configuration file examples.Sep 6 2018, 5:08 PM

Krenair closed subtask T203678: certcentral: Make configurable the cmd executed to perform a DNS zone update as Resolved.Sep 7 2018, 4:34 PM

Dzahn added a subtask: T155359: wikiba.se should use HTTPS.Sep 10 2018, 8:07 PM

Krenair mentioned this in T155359: wikiba.se should use HTTPS.Sep 10 2018, 8:40 PM

Krenair mentioned this in T204013: Horizon Designate dashboard not allowing creation of NS records.Sep 11 2018, 1:54 AM

Krenair added a parent task: T204994: Integrate certspotter with certcentral to avoid certspotter notifying us on legitimate certs generated by our certcentral boxes.Sep 20 2018, 6:02 PM

Krenair mentioned this in T204997: certcentral: delay deployment of renewed certs to wait out skewed client clocks.Sep 20 2018, 6:11 PM

Krenair added a parent task: T204997: certcentral: delay deployment of renewed certs to wait out skewed client clocks.

Change 458554 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/software/certcentral@debian] Debian packaging

https://gerrit.wikimedia.org/r/458554

Krenair mentioned this in rOSCCe5f3fcf9d8c1: Debian packaging.Oct 2 2018, 11:18 AM

Krenair mentioned this in rOSCCc5467887d462: Debian packaging.Oct 2 2018, 11:31 AM

• ema mentioned this in rOSCCd0f154d41fdd: Debian packaging.Oct 2 2018, 11:31 AM

Krenair mentioned this in rOSCC498757aec30c: Debian packaging.Oct 2 2018, 11:35 AM

Krenair mentioned this in rOSCCd9a25054a9b5: Debian packaging.

Krenair mentioned this in rOSCC716a477f7b87: Debian packaging.Oct 2 2018, 12:06 PM

Change 458554 merged by Ema:
[operations/software/certcentral@debian] Debian packaging

https://gerrit.wikimedia.org/r/458554

Vgutierrez closed subtask T206308: Create VMs for certcentral hosts as Resolved.Oct 8 2018, 11:13 AM

Change 465636 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/dns@master] Add discovery alias for certcentral

https://gerrit.wikimedia.org/r/465636

Change 465636 abandoned by Vgutierrez:
Add discovery alias for certcentral

https://gerrit.wikimedia.org/r/465636

Mentioned in SAL (#wikimedia-operations) [2018-10-10T15:59:52Z] <vgutierrez> Uploaded certcentral 0.1 to apt.wikimedia.org (stretch) - T199711

Krenair added a project: Acme-chief.Oct 13 2018, 4:56 PM

Krenair moved this task from Backlog to Goals/tracking on the Acme-chief board.Oct 13 2018, 5:01 PM

BBlack added a parent task: T207050: Migrate most standard public TLS certificates to CertCentral issuance.Oct 15 2018, 4:18 PM

Change 468315 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] certcentral: Add first domain for testing in prod

https://gerrit.wikimedia.org/r/468315

Change 468315 merged by Vgutierrez:
[operations/puppet@production] certcentral: Add first domain for testing in prod

https://gerrit.wikimedia.org/r/468315

Krenair closed subtask T206461: Provide a Let's Encrypt ACME v2 staging environment account as Resolved.Nov 5 2018, 3:20 PM

@Vgutierrez: I'm thinking we should close this and open a new task about improving our certcentral setup to the point where we could talk about using it for bigger things than the current list in the parent task.
We do now have live use for two prod certs (netbox and librenms), which is more than this task description asks for.

Krenair updated the task description. (Show Details)Nov 22 2018, 11:47 PM

Krenair removed a subtask: T155359: wikiba.se should use HTTPS.Nov 22 2018, 11:51 PM

Krenair added a parent task: T155359: wikiba.se should use HTTPS.

Krenair closed subtask T194962: Create and deploy a centralized letsencrypt service as Resolved.

I've rearranged the structure of these tasks to be logical and this has no more open subtasks and all the items in the task description are checked off. Work continues on some of the parent tasks :)

Krenair removed a parent task: T204997: certcentral: delay deployment of renewed certs to wait out skewed client clocks.Nov 22 2018, 11:56 PM

Krenair removed a parent task: T204994: Integrate certspotter with certcentral to avoid certspotter notifying us on legitimate certs generated by our certcentral boxes.

Deploy a scalable service for ACME (LetsEncrypt) certificate management
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

	Vgutierrez
	Jul 16 2018, 3:33 PM

Deploy a scalable service for ACME (LetsEncrypt) certificate managementClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Deploy a scalable service for ACME (LetsEncrypt) certificate management
Closed, ResolvedPublic
Actions

Related Objects
Search...