Page MenuHomePhabricator

Puppet hosts with their cert revoked can still run puppet
Closed, ResolvedPublic

Description

As discovered in T177374: decom wtp1001-wtp1024 hosts that are formally decom'd in the puppet CA (i.e. puppet cert list --all doesn't show the cert) can still run puppet successfully. In the wtp case reactivating the host in puppetdb and thus the host showing up in monitoring.

In practical terms this means enforcing checking puppet ca's crl when hosts are talking to puppet master(s)

Event Timeline

This comment was removed by herron.

Removed the previous comment after realizing those symlinks of course point to the same file!

It looks like SSLCARevocationCheck defaults to none and is currently unset in our config. So we should try setting this to chain. I'll work on testing this.

Mentioned in SAL (#wikimedia-operations) [2018-01-16T17:45:52Z] <herron> disabled puppet agents troubleshooting T184444

Tried setting SSLCARevocationCheck chain on puppetmaster1001 with agents disabled across the fleet. It resulted in a failed run with the below error.

Error: Could not retrieve catalog from remote server: SSL_connect returned=1 errno=0 state=unknown state: tlsv1 alert unknown ca
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

After some further testing this combination of SSLCARevocation settings appears to work.

SSLCARevocationFile     /var/lib/puppet/server/ssl/ca/ca_crl.pem
SSLCARevocationCheck    leaf

Puppet runs complete successfully before revoking the cert. After revoking this happens (output from puppet agent -t)

Warning: Unable to fetch my node definition, but the agent run will continue:
Warning: SSL_connect returned=1 errno=0 state=unknown state: sslv3 alert certificate revoked
Info: Retrieving pluginfacts
Error: /File[/var/lib/puppet/facts.d]: Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=unknown state: sslv3 alert certificate revoked
Error: /File[/var/lib/puppet/facts.d]: Could not evaluate: Could not retrieve file metadata for puppet:///pluginfacts: SSL_connect returned=1 errno=0 state=unknown state: sslv3 alert certificate revoked
Info: Retrieving plugin
Error: /File[/var/lib/puppet/lib]: Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=unknown state: sslv3 alert certificate revoked
Error: /File[/var/lib/puppet/lib]: Could not evaluate: Could not retrieve file metadata for puppet:///plugins: SSL_connect returned=1 errno=0 state=unknown state: sslv3 alert certificate revoked
Info: Loading facts
Error: Could not retrieve catalog from remote server: SSL_connect returned=1 errno=0 state=unknown state: sslv3 alert certificate revoked
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
Error: Could not send report: SSL_connect returned=1 errno=0 state=unknown state: sslv3 alert certificate revoked

Oddly, the same config with SSLCARevocationCheck chain still allowed agents with revoked certificates to run. Not yet sure why that would be the case.

I'll prep a patch for review.

Change 404587 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] puppetmaster::ssl: fix crl file suffix

https://gerrit.wikimedia.org/r/404587

Did some further testing and have a fix that better integrates with our puppetization and works with SSLCARevocationCheck chain

It looks like the tlsv1 alert unknown ca encountered earlier was caused by a bug in puppetmaster::ssl where a symlink to the crl is created with file suffix .0 instead of .r0

After placing a symlink of /var/lib/puppet/server/ssl/crl/c5aaad6f.r0 -> /var/lib/puppet/server/ssl/ca/ca_crl.pem enabling SSLCARevocationCheck chain in apache causes puppet agents with revoked certificates to fail as expected.

So, the first patch is a fix for puppetmaster::ssl (to persist what has been fixed manually on puppetmaster1001) and I'll continue working tomorrow on a patch to conditionally set SSLCARevocationCheck in the apache frontend template.

Change 404689 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] add support for SSLCARevocationCheck setting in puppetmaster frontend

https://gerrit.wikimedia.org/r/404689

Change 404587 merged by Herron:
[operations/puppet@production] puppetmaster::ssl: fix crl file suffix

https://gerrit.wikimedia.org/r/404587

Change 404689 merged by Herron:
[operations/puppet@production] add support for SSLCARevocationCheck setting in puppetmaster frontend

https://gerrit.wikimedia.org/r/404689

CRL is now being checked by the puppet master apache frontends.

There were a few issues encountered after deployment

  1. puppetmaster1001 certificate had been revoked - a fresh certificate for puppetmaster1001 was generated and signed today
  2. simliar to #1 a few nodes were also running with revoked certs - these certs were regenerated and signed
  3. some nodes are running with a valid signed certificate that somehow is not present on the puppet master (do not appear in puppet cert list, or /var/lib/puppet/server/ssl/ca/signed on puppetmaster1001).

#3 is complicated by the fact that some hosts are exposing their puppet certificate/key to be used by other applications.

A full list fo nodes and current status is tracked at https://etherpad.wikimedia.org/p/volans-tmp3

Resolving this task as agents with revoked certs are no longer able to run puppet.

Follow-up task for issue 3 above (hosts with a valid signed certificate present on the agent but not the master) is T185239

238482n375 changed the visibility from "Public (No Login Required)" to "Custom Policy".
This comment was removed by Volans.
Volans changed the visibility from "Custom Policy" to "Public (No Login Required)".Jun 15 2018, 8:32 AM