- Core DC redundancy
- Wildcards + SANs
- Multiple client hosts for each cert
- Coding + Puppetization: PoC drafted during Wikimedia Hackaton 2018. Efforts tracked in T194962
- Deploy in both DCs
- Live use for one prod cert
Description
Details
Event Timeline
Change 455153 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] Provide support in the API for different certificate save modes
Change 455159 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] [WIP] Validate challenges before pushing them to the ACME directory
Change 451867 merged by jenkins-bot:
[operations/software/certcentral@master] Refactor certcentral.certificate_management()
Change 453124 merged by jenkins-bot:
[operations/software/certcentral@master] Implement different Certificate.save() modes
Change 454045 merged by jenkins-bot:
[operations/software/certcentral@master] Certcentral integration tests
Change 454794 merged by jenkins-bot:
[operations/software/certcentral@master] Deliver certificates in every save mode
Change 454845 merged by jenkins-bot:
[operations/software/certcentral@master] Implement DNS01 challenge support
Change 455153 merged by jenkins-bot:
[operations/software/certcentral@master] Provide support in the API for different certificate save modes
Change 456110 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] ACMERequests: Remove orders/challenges after a non-recoverable error
Change 456644 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] Provide logging
I'm CCing Moritz in case he has any advice or other ideas for the below. Basically what we've got is one backend process doing maintenance tasks (generate private keys, initial self-signed cert to get stuff running, arrange for ACME-provided cert, handle renewals, etc.) and one API UWSGI process (which hands the right private keys and certificates out to authorised client hosts) that sits behind nginx (which is doing TLS termination and checking that client's certificates are signed by the Puppet CA).
One of the things that me and @Vgutierrez spoke about briefly over IRC recently was a certcentral system user. It's become clear to me that this will be a problem (my tests so far had the backend running as www-data) - the backend process can be set to run as the new certcentral user no problem, but then the private key files it generates are not readable by the API (running as www-data still). So I think we have a few options (none of which really appeal to me), in no particular order:
- Set the group of the files to be www-data, chmod the files 640.
- Put www-data in our certcentral system group, chmod the files 640.
- Change modules/uwsgi/templates/initscripts/uwsgi.systemd.erb to permit UWSGI to run as a custom user (pass in as param, default to www-data), and make it run as certcentral in our case. The files can be chmod 600. Might have to check how the nginx<->uwsgi socket works in this case.
- Maybe we're just being paranoid and we can make the backend run as www-data and drop the idea of a certcentral system user entirely. The files can be chmod 600.
With the last two options it does mean of course that if a client (which had a cert signed by our puppet CA, or was nginx being evil) managed to compromise the certcentral API they could break the system by overwriting important stuff, or replace our private keys/certs with compromised ones, though at that point you can already read in the existing private keys so I can't think why you'd do that.
I think the only other thing running as www-data should be nginx which we already have to fully trust, given that it:
- is trusted to tell the certcentral API process the client host's identity for authorisation purposes (it handles TLS termination so has to tell us what CN the client's puppet cert had), so could trivially extract the private keys if it was evil
- will normally ultimately be handling our private keys on client hosts anyway, with maybe a few exceptions (apache for a few misc things, exim for MXes, maybe something obscure somewhere I'm not aware of). We already have very big problems if nginx has a vulnerability at this level.
Change 457378 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] Rename certcentral_api to just api
With the two users approach (certcentral / www-data) we just stop nginx from writing in /etc/certcentral. We should also consider that certcentral will need permissions to spawn the DNS update zone script, and I don't see any reasons to let nginx do that as well. IMHO the two users approach serves best to the principle of least privilege.
That makes sense, so we're preferring one of these:
- Set the group of the files to be www-data, chmod the files 640.
- Put www-data in our certcentral system group, chmod the files 640.
Change 455159 merged by jenkins-bot:
[operations/software/certcentral@master] Validate challenges before pushing them to the ACME directory
Change 456110 merged by jenkins-bot:
[operations/software/certcentral@master] ACMERequests: Remove orders/challenges after a non-recoverable error
Change 456644 merged by Vgutierrez:
[operations/software/certcentral@master] Provide logging
Change 457485 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/software/certcentral@master] README: provide configuration file examples
Change 457378 merged by jenkins-bot:
[operations/software/certcentral@master] Rename certcentral_api to just api
Change 457485 merged by jenkins-bot:
[operations/software/certcentral@master] README: provide configuration file examples
Got a bunch of patches open that need review, in no particular order:
- This one is needed to implement the outcome of the above comments around file and user permissions: https://gerrit.wikimedia.org/r/458933
- This one is needed to prevent puppet failing while only initial (i.e. non-trusted) certs exist: https://gerrit.wikimedia.org/r/458939
- This one is needed to tell our script which gdnsd servers to push to: https://gerrit.wikimedia.org/r/459581
- This one is needed to fix cases where ACME challenge values can begin with a hyphen: https://gerrit.wikimedia.org/r/459662
- This one is needed to do debian packaging: https://gerrit.wikimedia.org/r/458554
- This one is needed to reload config in the API at runtime: https://gerrit.wikimedia.org/r/459785
- This one is needed to detect config changes to existing certificates: https://gerrit.wikimedia.org/r/460382
Other stuff that should happen:
- Brandon doing the gdnsd update we need (T194965)
- Basic puppetisation of the service: https://gerrit.wikimedia.org/r/441991
- Puppetisation of the DNS integration https://gerrit.wikimedia.org/r/459809
- VM request. internal IP, must be reachable by the puppetmasters on some to-be-confirmed port, must be able to SSH to authdns servers at all sites, must be able to talk out to LE via proxy
Change 458554 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/software/certcentral@debian] Debian packaging
Change 458554 merged by Ema:
[operations/software/certcentral@debian] Debian packaging
Change 465636 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/dns@master] Add discovery alias for certcentral
Mentioned in SAL (#wikimedia-operations) [2018-10-10T15:59:52Z] <vgutierrez> Uploaded certcentral 0.1 to apt.wikimedia.org (stretch) - T199711
Change 468315 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] certcentral: Add first domain for testing in prod
Change 468315 merged by Vgutierrez:
[operations/puppet@production] certcentral: Add first domain for testing in prod
@Vgutierrez: I'm thinking we should close this and open a new task about improving our certcentral setup to the point where we could talk about using it for bigger things than the current list in the parent task.
We do now have live use for two prod certs (netbox and librenms), which is more than this task description asks for.
I've rearranged the structure of these tasks to be logical and this has no more open subtasks and all the items in the task description are checked off. Work continues on some of the parent tasks :)