Umbrella and high level description task
Objectives
The main goal for this proposal is to not store sensible information anymore (in this case, TLS keys) permanently on the cache hosts storage media. This would make unauthorized retrieval of sensitive data harder or virtually impossible when the host is powered off (eg. during maintenances, transportation or just shut down (in case of virtual hosts)).
While TLS keys and cache hosts are just a class of data/servers, the idea could be expanded to other sensitive data and other host classes. At the moment though, we can focus on this specific objective to define a solution that's better than the current situation.
Constraints
While discussing this topic with some members of Traffic Team we agreed that any proposal should avoid the needing of any manual (human) intervention to complete the boot process and have the host ready to be pooled.
This means that (for example) we can't ask someone to manually fetch/unlock/decrypt TLS keys on the target host at each boot.
Implementation
After discussing various proposals with the Traffic Team, I've settled on a easy roadmap that should provide most of the benefit and at the same time being improvable in the future with more sophisticated techniques, meaning using tmpfs storage (managed by systemd) and let Puppet download TLS material as it already does, but defining some more dependencies to be sure the order is respected. This allow us to implement this in a relatively short time and reuse most of the existing components (acme_cert puppet module, systemd-tmpfiles)
The steps needed to consider a cache host fully operative
- First puppet run on boot is executed, to ensure
- tmpfiles directory for TLS keys are created and appropriate permissions are set
- TLS keys and certificates are dowloaded into the volatile directories (acme_cert module and DigiCert ones too) (depends, on puppet, on the previous step)
- Puppet tries to start HAProxy service, if all previous steps are ok. HAProxy service also checks as pre-requisite that TLS material is valid (not expired).
List of tasks
-
Define new type of systemd unit for puppet (path) (T387799) - Create tmpfiles.d configuration for TLS certificates (T387826)
- Allow acme_cert and sslcert Puppet modules to allow dowloading certificates into different locations (T387929)
- Write a ExecStartPre script to check that TLS material is currently valid (T388147)
- Edit HAProxy configuration and acme_cert/sslcert to use certificates from volatile storage
- Start on single host (cp4047 that is already depooled and silenced) and test
- Deploy on 2 hosts (upload|text) serving live traffic (cp7001|cp7009)
- Deploy on whole DC (magru)
- Deploy everywhere
- magru
- ulsfo
- eqsin
- codfw
- drmrs
- eqiad
- esams
- Shred old certificates