Page MenuHomePhabricator

TLS certificates for network devices
Closed, ResolvedPublic

Description

We're going to need TLS certificates to interact with network devices in the short (eg. this Q, for testing) to medium (next 2 Q, for production) term.

Most likely through gNMI (RPC, eg. configuration or telemetry) and/or RESTCONF (rest, eg. configuration).

Some notes from a chat with @jbond:

  • certs generated by our regular CA
    • expire after 4 weeks (not having a renew mechanism is fine for testing but not for prod)
    • wouldn't be suitable for client certificate authentication. Not strictly needed as username/password is required anyway.
  • One option could be to use an intermediary CA dedicated to network devices (~1h of work)
    • This could solve the two limitations above (if we want to solve them)
    • Does this bring other advantages? Are there any drawbacks?
  • For automatic renewal, some ideas:
    1. Use Puppet on the SONIC (Debian based) switches to manage the auto-renewal scripts - not compatible with Junos
    2. Use a docker image on the SONIC switches (they run docker) to handle renewal - not compatible with Junos
    3. Manage the script ourselves on both platforms (Eg. Homer or dedicated script)
    4. Use a cookbook that connects to the devices over SSH and does the work (would needs to run periodically)

Event Timeline

ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

the logic we use in puppet is mostly the same as this script which would be a good template to use for a cookbook

I'd consider client auth a "stretch goal" for now, nice to have but not sure we want to have all that extra complexity.

In terms of an intermediate CA just for network - it's an option but I would worry about how we deal with the security / key management aspects of it. Using our existing CA means we can leverage existing processes there which might be better. We need a way to update the certs on the box either way, just more frequently if using ours.

Overall it's definitely a headache. In terms of puppet on devices I think, with the right image, you can run Puppet on the Junipers, but it's not something I'd rush into. I'd probably lean towards your option 4. Least elegant, but easiest to do? Happy to explore any of the options here not something I've a great deal of experience with.

I would worry about how we deal with the security / key management aspects of it.

Just to expand on this a bit the reason why there may be a need for an additional intermediate CA would be if we wanted to use client auth or if the 4 week certificate lifetime. both of theses would mean that we need to create a new intermediate as they are both policies of the signer. however this would all be using the same central pki infrastructure, the only real down side is that we would need to create a new https://wikitech.wikimedia.org/wiki/PKI/CA_Operations#Adding_a_new_intermediate

We need a way to update the certs on the box either way, just more frequently if using ours.

+1 extending the lifetime is just delaying the issue and increasing the possibility its forgotten or missed

+1 extending the lifetime is just delaying the issue and increasing the possibility its forgotten or missed

Yes and no. It depends on how much we can automate it with reasonable efforts.

If we have a short lifetime (eg. every few weeks/months) we need to have fully automated tooling to renew it. And that is the preferred "in a perfect world" option.

If we have something we need to renew every year (or less often), with alerting, a manual action can be tolerated (eg. running a cookbook).

So far the cookbook option is preferred, especially as it could in theory be fully automated (even with just a cron).

However after a chat with @Volans we're not sure about the approach of implementing P46511 using a cookbook, and we were wondering if the cookbook shouldn't instead only be used to deploy the cert. While the generation side of it be handled with another system.

We had a chat about this.

The first iteration will be a manual cookbook that takes a host as parameter.
The cookbook will connect to the device and see if there is already a CSR/key/cert configured. In that case it will check the expiration date.
If they're present (and optionally close to expire), the cookbook will generate a new certificate and update it.
If they're not present, the cookbook will generate a new csr/key/cert (from a dedicated intermediary CA, with a 1 year expiration time) and install them on the device.

One identified limitation currently is the lack of SCP support in Spicerack, however possible workarounds exist.

Monitoring for the expiration date will be done using our current Prometheus blackbox check (the management routers ACLs will need to permit that flow).

Follow up iteration, if widely used in prod, will be to use a shorter expiration time and automate the renewal.
The current idea is to setup a systemd timer that regularly runs the cookbook with a "all" (or similar) parameter. To do the actions listed previously on all the devices.

On identified limitation here is the lack of "silent" run in Spicerack (tracked in T324655: Spicerack: don't IRC log start/stop of cookbook)

Change 933094 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/cookbooks@master] [WIP] Manage TLS on network devices

https://gerrit.wikimedia.org/r/933094

Change 937510 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Add python3-cryptography to cookbooks

https://gerrit.wikimedia.org/r/937510

Change 938218 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] pki: add network devices CA

https://gerrit.wikimedia.org/r/938218

Change 937510 merged by Ayounsi:

[operations/puppet@production] Add python3-cryptography to cookbooks

https://gerrit.wikimedia.org/r/937510

Change 938218 merged by Jbond:

[operations/puppet@production] pki: add network devices CA

https://gerrit.wikimedia.org/r/938218

Change 933094 merged by jenkins-bot:

[operations/cookbooks@master] Manage TLS on network devices

https://gerrit.wikimedia.org/r/933094

SONiC refresh needed verbose
ayounsi@cumin1001:~$ sudo cookbook -v sre.network.tls lsw1-e8-eqiad
START - Cookbook sre.network.tls for network device lsw1-e8-eqiad
lsw1-e8-eqiad: 🕔 Certificate expires in less than 28 days, 0:00:00. refresh needed.
----- OUTPUT of 'cat ~/csr.pem' -----
-----BEGIN CERTIFICATE REQUEST-----
[redacted]
-----END CERTIFICATE REQUEST-----
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'cat ~/csr.pem'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
lsw1-e8-eqiad: 🔏 cfssl called with operation: sign.
lsw1-e8-eqiad: ⚙️ Deploy needed.
----- OUTPUT of 'echo '-----BEGIN...host/default.crt' -----
-----BEGIN CERTIFICATE-----
[redacted]
-----END CERTIFICATE-----

================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'echo '-----BEGIN...host/default.crt'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of 'sudo service telemetry restart' -----
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo service telemetry restart'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
lsw1-e8-eqiad: 👍 All done.
END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e8-eqiad
SONiC no refresh needed verbose
ayounsi@cumin1001:~$ sudo cookbook -v sre.network.tls lsw1-e8-eqiad
START - Cookbook sre.network.tls for network device lsw1-e8-eqiad
lsw1-e8-eqiad: 👍 Nothing to do.
END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e8-eqiad

However, with a different sonic device

ayounsi@cumin1001:~$ sudo cookbook -v sre.network.tls lsw1-f8-eqiad
START - Cookbook sre.network.tls for network device lsw1-f8-eqiad
Exception raised while executing cookbook sre.network.tls:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 212, in run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/network/tls.py", line 106, in run
    if self.need_initial_setup(cert):
  File "/srv/deployment/spicerack/cookbooks/sre/network/tls.py", line 138, in need_initial_setup
    cert_name = cert_x509.subject.get_attributes_for_oid(NameOID.COMMON_NAME)[0].value
IndexError: list index out of range
END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f8-eqiad

I'll have a look.

Change 939261 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/cookbooks@master] sre.network.tls: fix edge case

https://gerrit.wikimedia.org/r/939261

Change 939261 merged by jenkins-bot:

[operations/cookbooks@master] sre.network.tls: fix edge case

https://gerrit.wikimedia.org/r/939261

Change 952851 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/cookbooks@master] sre.network.tls: use different ports on junos/sonic

https://gerrit.wikimedia.org/r/952851

Change 952851 merged by jenkins-bot:

[operations/cookbooks@master] sre.network.tls: use different ports on junos/sonic

https://gerrit.wikimedia.org/r/952851

Change 953510 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/cookbooks@master] sre.network.tls: use fqdn's hostname to store cert in config

https://gerrit.wikimedia.org/r/953510

Change 953510 merged by jenkins-bot:

[operations/cookbooks@master] sre.network.tls: use fqdn's hostname to store cert in config

https://gerrit.wikimedia.org/r/953510

ayounsi claimed this task.

This is now working in prod.