Page MenuHomePhabricator

Private TLS material (TLS keys) should be stored in volatile storage only
Closed, ResolvedPublic

Description

Umbrella and high level description task

Objectives

The main goal for this proposal is to not store sensible information anymore (in this case, TLS keys) permanently on the cache hosts storage media. This would make unauthorized retrieval of sensitive data harder or virtually impossible when the host is powered off (eg. during maintenances, transportation or just shut down (in case of virtual hosts)).

While TLS keys and cache hosts are just a class of data/servers, the idea could be expanded to other sensitive data and other host classes. At the moment though, we can focus on this specific objective to define a solution that's better than the current situation.

Constraints

While discussing this topic with some members of Traffic Team we agreed that any proposal should avoid the needing of any manual (human) intervention to complete the boot process and have the host ready to be pooled.

This means that (for example) we can't ask someone to manually fetch/unlock/decrypt TLS keys on the target host at each boot.

Implementation

After discussing various proposals with the Traffic Team, I've settled on a easy roadmap that should provide most of the benefit and at the same time being improvable in the future with more sophisticated techniques, meaning using tmpfs storage (managed by systemd) and let Puppet download TLS material as it already does, but defining some more dependencies to be sure the order is respected. This allow us to implement this in a relatively short time and reuse most of the existing components (acme_cert puppet module, systemd-tmpfiles)

The steps needed to consider a cache host fully operative

  • First puppet run on boot is executed, to ensure
    • tmpfiles directory for TLS keys are created and appropriate permissions are set
    • TLS keys and certificates are dowloaded into the volatile directories (acme_cert module and DigiCert ones too) (depends, on puppet, on the previous step)
  • Puppet tries to start HAProxy service, if all previous steps are ok. HAProxy service also checks as pre-requisite that TLS material is valid (not expired).

List of tasks

  • Define new type of systemd unit for puppet (path) (T387799)
  • Create tmpfiles.d configuration for TLS certificates (T387826)
  • Allow acme_cert and sslcert Puppet modules to allow dowloading certificates into different locations (T387929)
  • Write a ExecStartPre script to check that TLS material is currently valid (T388147)
  • Edit HAProxy configuration and acme_cert/sslcert to use certificates from volatile storage
    • Start on single host (cp4047 that is already depooled and silenced) and test
    • Deploy on 2 hosts (upload|text) serving live traffic (cp7001|cp7009)
    • Deploy on whole DC (magru)
    • Deploy everywhere
      • magru
      • ulsfo
      • eqsin
      • codfw
      • drmrs
      • eqiad
      • esams
    • Shred old certificates

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1112773 abandoned by Fabfur:

[operations/puppet@production] acmecerts: new param to use tmpfs storage for certificates

Reason:

please don't consider that

https://gerrit.wikimedia.org/r/1112773

Fabfur changed the task status from Open to In Progress.Mar 19 2025, 4:08 PM
Fabfur triaged this task as Medium priority.

Change #1129223 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] WIP: using tmpfs directory for private tls material

https://gerrit.wikimedia.org/r/1129223

Fabfur renamed this task from Allow acmecerts to deploy certificates in tmpfs storage to Private TLS material (TLS keys) should be stored in volatile storage only.Mar 21 2025, 2:39 PM

Change #1129223 merged by Fabfur:

[operations/puppet@production] haproxy: using tmpfs directory for private tls material

https://gerrit.wikimedia.org/r/1129223

This change has been applied to cp4047 (currently depooled and silenced due to T387238). All went fine and our checks indicates that the behavior is the expected one (especially after the reboot when the haproxy service fails to start due to the failed ExecStartPre script that doesn't find any certificate).
After the first puppet run certificates are correctly downloaded into the new location and haproxy started as usual.

Change #1131052 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] haproxy: use volatile storage for 2 hosts on magru

https://gerrit.wikimedia.org/r/1131052

Change #1131052 merged by Fabfur:

[operations/puppet@production] haproxy: use volatile storage for 2 hosts on magru

https://gerrit.wikimedia.org/r/1131052

Mentioned in SAL (#wikimedia-operations) [2025-03-27T08:41:15Z] <fabfur> repooling cp7001 and cp7009 with new TLS certificate path (T384227)

Change #1131705 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] hiera: enable TLS on volatile storage in magru

https://gerrit.wikimedia.org/r/1131705

Change #1131705 merged by Fabfur:

[operations/puppet@production] hiera: enable TLS on volatile storage in magru

https://gerrit.wikimedia.org/r/1131705

Mentioned in SAL (#wikimedia-operations) [2025-04-01T14:50:02Z] <fabfur> depooled cp7001 to test secure removal of unused certificates (T384227)

Mentioned in SAL (#wikimedia-operations) [2025-04-01T17:23:12Z] <fabfur> repool cp7001, no certs removed (T384227)

Mentioned in SAL (#wikimedia-operations) [2025-04-02T07:29:22Z] <fabfur> depool cp7001 to fix stale ocsp alert (T384227)

Mentioned in SAL (#wikimedia-operations) [2025-04-02T11:44:24Z] <fabfur> securely erase certificates from A:cp-magru and provide symlink for acmecerts (T384227)

Change #1133405 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] hiera: enable TLS on volatile storage in ulsfo

https://gerrit.wikimedia.org/r/1133405

Fabfur updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2025-04-03T07:31:18Z] <fabfur> applying patch to use TLS on tmpfs on A:cp-ulsfo (T384227)

Change #1133405 merged by Fabfur:

[operations/puppet@production] hiera: enable TLS on volatile storage in ulsfo

https://gerrit.wikimedia.org/r/1133405

Mentioned in SAL (#wikimedia-operations) [2025-04-03T08:53:19Z] <fabfur> secure deleting certificates in /etc/ssl/private from A:cp-magru (T384227)

Mentioned in SAL (#wikimedia-operations) [2025-04-03T09:03:24Z] <fabfur> secure deleting certificates in /etc/ssl/private from A:cp-ulsfo (T384227)

Change #1133850 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] hiera: enable TLS on volatile storage in eqsin

https://gerrit.wikimedia.org/r/1133850

Change #1133850 merged by Fabfur:

[operations/puppet@production] hiera: enable TLS on volatile storage in eqsin

https://gerrit.wikimedia.org/r/1133850

Change #1133897 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] hiera: enable TLS on volatile storage in codfw

https://gerrit.wikimedia.org/r/1133897

Change #1133897 merged by Fabfur:

[operations/puppet@production] hiera: enable TLS on volatile storage in codfw

https://gerrit.wikimedia.org/r/1133897

Change #1134630 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] hiera: enable TLS on volatile storage in drmrs

https://gerrit.wikimedia.org/r/1134630

Change #1134630 merged by Fabfur:

[operations/puppet@production] hiera: enable TLS on volatile storage in drmrs

https://gerrit.wikimedia.org/r/1134630

Change #1134648 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] hiera: enable TLS on volatile storage in eqiad

https://gerrit.wikimedia.org/r/1134648

Change #1134648 merged by Fabfur:

[operations/puppet@production] hiera: enable TLS on volatile storage in eqiad

https://gerrit.wikimedia.org/r/1134648

Change #1134658 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] hiera: enable TLS on volatile storage in esams

https://gerrit.wikimedia.org/r/1134658

Change #1134658 abandoned by Fabfur:

[operations/puppet@production] hiera: enable TLS on volatile storage in esams

Reason:

Splitting into 2 different patches

https://gerrit.wikimedia.org/r/1134658

Change #1134689 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] hiera: enable TLS on volatile storage in esams

https://gerrit.wikimedia.org/r/1134689

Change #1134689 merged by Fabfur:

[operations/puppet@production] hiera: enable TLS on volatile storage in esams

https://gerrit.wikimedia.org/r/1134689

Change #1134698 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] hiera: cleanup TLS on volatile storage custom files

https://gerrit.wikimedia.org/r/1134698

Change #1134698 merged by Fabfur:

[operations/puppet@production] hiera: cleanup TLS on volatile storage custom files

https://gerrit.wikimedia.org/r/1134698

Fabfur updated the task description. (Show Details)
Fabfur updated the task description. (Show Details)