Page MenuHomePhabricator

Update varnishkafka client certificate for authenticating to kafka-jumbo
Closed, ResolvedPublic1 Estimated Story Points

Description

We have received a warning that the puppet certificate for varnishkafka is soon to expire.

image.png (60×380 px, 9 KB)

This is confirmed by checking the certificate file that is present on all caching proxy servers, e.g. cp1075.

btullis@cp1075:/etc/varnishkafka/ssl$ cat varnishkafka.crt.pem | openssl x509 -noout -dates
notBefore=Dec 13 15:55:06 2017 GMT
notAfter=Dec 13 15:55:06 2022 GMT

This certificate will need to be renewed, redeployed, and the varnishkafka service restarted on all cp* hosts.
Failure to do so before the expiry date will result in data loss in the webrequest stream.

The renewal process is similar to that described here: https://wikitech.wikimedia.org/wiki/Kafka/Administration#Kafka_Certificates
However, since it is a *client* certificate (where that client is varnishkafka) the process to make it live is somewhat different.

Event Timeline

BTullis triaged this task as High priority.Nov 24 2022, 2:44 PM
BTullis moved this task from Backlog to Shared Data Infra on the Data-Engineering-Planning board.

Checking the cert status on one of the cp hosts.

stevemunene@cp1077:~$ cat /etc/varnishkafka/ssl/varnishkafka.crt.pem | openssl x509 -noout -dates
notBefore=Dec 13 15:55:06 2017 GMT
notAfter=Dec 13 15:55:06 2022 GMT

Verify existing server certificates on puppet master

stevemunene@puppetmaster1001:~$ cergen -c 'varnishkafka*' --base-path=/srv/private/modules/secret/secrets/certificates /srv/private/modules/secret/secrets/certificates/certificate.manifests.d

Status of certificates ['varnishkafka']

Certificate(varnishkafka, authorities=[PuppetCA(puppetmaster1001.eqiad.wmnet_8140)]):
	/srv/private/modules/secret/secrets/certificates/varnishkafka/varnishkafka.key.private.pem: PRESENT (mtime: 2019-10-11T12:49:56.630059)
	/srv/private/modules/secret/secrets/certificates/varnishkafka/varnishkafka.key.public.pem: PRESENT (mtime: 2019-10-11T12:49:56.630059)
	/srv/private/modules/secret/secrets/certificates/varnishkafka/varnishkafka.crt.pem: PRESENT (mtime: 2019-10-11T12:49:56.630059)
	/srv/private/modules/secret/secrets/certificates/varnishkafka/ca.crt.pem: PRESENT (mtime: 2019-12-10T14:26:42.572697)
	/srv/private/modules/secret/secrets/certificates/varnishkafka/varnishkafka.keystore.p12: PRESENT (mtime: 2019-10-11T12:49:56.630059)
	/srv/private/modules/secret/secrets/certificates/varnishkafka/varnishkafka.keystore.jks: PRESENT (mtime: 2019-10-11T12:49:56.630059)
	/srv/private/modules/secret/secrets/certificates/varnishkafka/truststore.jks: PRESENT (mtime: 2019-10-11T12:49:56.630059)

@Ottomata not sure how much time we have, in theory all Jumbo brokers will need to be able to accept the new PKI cert when validating the vk's client one, it may require some prior testing in Kafka test (I haven't checked this use case when I tried PKI in there). I am 100% onboard to move Jumbo to PKI, it should be relatively easy, but there is the main question of the December holidays etc.. (to make big changes). Jumbo is also used for banner impression checks IIRC, they usually stop us from doing anything during this time of the year IIRC :(

Okay, let's just regen the new certs using cergen for now then.

Yeah, I agree with @elukey. It's definitely a good case, but we only have until next Tuesday before this certificate expires and it feels a bit risky to try to use the PKI in that time. (Even though it will probably just work.)

@Stevemunene here's my draft action plan for how I would go about this upgrade.
I would do something like:

  • Make sure that the Traffic team is aware of this plan
  • Make sure that you're comfortable checking the status of all varnishkafka instances on each cp server - logs and systemctl status etc.
  • When you're ready...
    • Make sure to announce your intentions on #wikimedia-traffic and #wikimedia-operations to make sure no-one objects
    • disable puppet on all cp* servers
    • merge your change to the secrets repository
    • re-enable puppet and run it on a single cp host
    • verify that the certificate/keypair on disk is updated
    • check to see whether it has restarted any varnishkafka instances
    • if not, restart these varnishkafka instances
    • check that the new certificate is in use by the varnishkafka instances - hopefully it's in the logs of either the varnishkafka client or in the broker logs
  • When you're happy with that change...
    • Re-enable and run puppet on all cp* servers
    • If necessary, perform a rolling restart of all varnishkafka instances to make sure that they all pick up the new certificate.

We can check the puppet code in advance to see if we think that the varnishkafka instances will restart...

My suspicion is that the varnishkafka instance will automatically restart when the certificate is updated:
https://github.com/wikimedia/puppet/blob/production/modules/varnishkafka/manifests/instance.pp#L145

...but I'd still prefer to test it on a single host first.

disabling puppet temporarily on cp hosts
stevemunene@cumin1001:~$ sudo cumin A:cp "disable-puppet 'renewing varnishkafka certificates - T323771 - ${USER}'"
96 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4037-4052].ulsfo.wmnet
Ok to proceed on 96 hosts? Enter the number of affected hosts to confirm or "q" to quit 96

NO OUTPUT

PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (96/96) [01:23<00:00, 1.15hosts/s]
FAIL | | 0% (0/96) [01:23<?, ?hosts/s]
100.0% (96/96) success ratio (>= 100.0% threshold) for command: 'disable-puppet '...1 - stevemunene''.
100.0% (96/96) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
stevemunene@cumin1001:~$

verify certs

root@puppetmaster1001:~# puppet cert list varnishkafka
Warning: `puppet cert` is deprecated and will be removed in a future release.
   (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run')
+ "varnishkafka" (SHA256) 37:57:F9:68:D0:1F:EF:A8:51:82:59:8B:B0:69:52:B2:14:F2:28:C2:42:78:70:FD:84:96:81:6F:74:8F:03:40

clean and destroy varnishkafka

root@puppetmaster1001:~# puppet cert clean varnishkafka
Warning: `puppet cert` is deprecated and will be removed in a future release.
   (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run')
Notice: Revoked certificate with serial 3429
Notice: Removing file Puppet::SSL::Certificate varnishkafka at '/var/lib/puppet/server/ssl/ca/signed/varnishkafka.pem'
Notice: Removing file Puppet::SSL::Certificate varnishkafka at '/var/lib/puppet/server/ssl/certs/varnishkafka.pem'
root@puppetmaster1001:~# puppet cert destroy varnishkafka
Warning: `puppet cert` is deprecated and will be removed in a future release.
   (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run')
Notice: Revoked certificate with serial 3429
root@puppetmaster1001:~#

Generate the certificates

root@puppetmaster1001:~# cergen --generate --force -c 'varnishkafka' --base-path=/srv/private/modules/secret/secrets/certificates /srv/private/modules/secret/secrets/certificates/certificate.manifests.d
2022-12-08 09:43:17,994 INFO     cergen                                   Generating certificates ['varnishkafka'] with force=True
2022-12-08 09:43:17,994 INFO     Certificate(varnishkafka)                Generating all files, force=True...
2022-12-08 09:43:17,996 INFO     Certificate(varnishkafka)                Generating certificate file
/usr/lib/python3/dist-packages/urllib3/connection.py:362: SubjectAltNameWarning: Certificate for puppetmaster1001.eqiad.wmnet has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
/usr/lib/python3/dist-packages/urllib3/connection.py:362: SubjectAltNameWarning: Certificate for puppetmaster1001.eqiad.wmnet has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
/usr/lib/python3/dist-packages/urllib3/connection.py:362: SubjectAltNameWarning: Certificate for puppetmaster1001.eqiad.wmnet has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
2022-12-08 09:43:19,587 INFO     Certificate(varnishkafka)                Generating CA certificate file
2022-12-08 09:43:19,587 INFO     Certificate(varnishkafka)                Generating PKCS12 keystore file
2022-12-08 09:43:19,948 INFO     Certificate(varnishkafka)                Generating Java keystore file
2022-12-08 09:43:20,933 INFO     Certificate(varnishkafka)                Importing PuppetCA(puppetmaster1001.eqiad.wmnet_8140) cert into Java keystore
2022-12-08 09:43:21,922 INFO     Certificate(varnishkafka)                Generating Java truststore file with CA certificate PuppetCA(puppetmaster1001.eqiad.wmnet_8140)

Status of certificates ['varnishkafka']

Certificate(varnishkafka, authorities=[PuppetCA(puppetmaster1001.eqiad.wmnet_8140)]):
	/srv/private/modules/secret/secrets/certificates/varnishkafka/varnishkafka.key.private.pem: PRESENT (mtime: 2022-12-08T09:43:17.991139)
	/srv/private/modules/secret/secrets/certificates/varnishkafka/varnishkafka.key.public.pem: PRESENT (mtime: 2022-12-08T09:43:17.991139)
	/srv/private/modules/secret/secrets/certificates/varnishkafka/varnishkafka.crt.pem: PRESENT (mtime: 2022-12-08T09:43:19.583137)
	/srv/private/modules/secret/secrets/certificates/varnishkafka/ca.crt.pem: PRESENT (mtime: 2022-12-08T09:43:19.583137)
	/srv/private/modules/secret/secrets/certificates/varnishkafka/varnishkafka.keystore.p12: PRESENT (mtime: 2022-12-08T09:43:19.599137)
	/srv/private/modules/secret/secrets/certificates/varnishkafka/varnishkafka.keystore.jks: PRESENT (mtime: 2022-12-08T09:43:21.387135)
	/srv/private/modules/secret/secrets/certificates/varnishkafka/truststore.jks: PRESENT (mtime: 2022-12-08T09:43:22.267134)


root@puppetmaster1001:~#

Commit the changes made and test running puppet.

Certificate has been updated and the service was not restarted automatically so a restart of the varnishkafka services will be required

stevemunene@cp1075:~$  cat /etc/varnishkafka/ssl/varnishkafka.crt.pem | openssl x509 -noout -dates
notBefore=Dec  7 09:43:19 2022 GMT
notAfter=Dec  7 09:43:19 2027 GMT

Mentioned in SAL (#wikimedia-operations) [2022-12-08T09:56:22Z] <steve_munene> restarting varnishkafka-webrequest.service on host cp1075 T323771

Successfully restarted services varnishkafka-eventlogging.service varnishkafka-statsv.service varnishkafka-webrequest.service and verified ssl

Re enabling puppet on all cp hosts

stevemunene@cumin1001:~$ sudo cumin A:cp "puppet agent --enable"
96 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4037-4052].ulsfo.wmnet
Ok to proceed on 96 hosts? Enter the number of affected hosts to confirm or "q" to quit 96
===== NO OUTPUT =====                                                                                                                                                  
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (96/96) [00:05<00:00, 16.65hosts/s]
FAIL |                                                                                                                                |   0% (0/96) [00:05<?, ?hosts/s]
100.0% (96/96) success ratio (>= 100.0% threshold) for command: 'puppet agent --enable'.
100.0% (96/96) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
stevemunene@cumin1001:~$

All cp hosts have updated certs

stevemunene@cumin1001:~$ sudo cumin A:cp "cat /etc/varnishkafka/ssl/varnishkafka.crt.pem | openssl x509 -noout -dates"
96 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4037-4052].ulsfo.wmnet
Ok to proceed on 96 hosts? Enter the number of affected hosts to confirm or "q" to quit 96
===== NODE GROUP =====                                                                                                                                                 
(96) cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4037-4052].ulsfo.wmnet       
----- OUTPUT of 'cat /etc/varnish...09 -noout -dates' -----                                                                                                            
notBefore=Dec  7 09:43:19 2022 GMT                                                                                                                                     
notAfter=Dec  7 09:43:19 2027 GMT                                                                                                                                      
================                                                                                                                                                       
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (96/96) [00:03<00:00, 28.96hosts/s]
FAIL |                                                                                                                                |   0% (0/96) [00:03<?, ?hosts/s]
100.0% (96/96) success ratio (>= 100.0% threshold) for command: 'cat /etc/varnish...09 -noout -dates'.
100.0% (96/96) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
stevemunene@cumin1001:~$

Mentioned in SAL (#wikimedia-operations) [2022-12-08T10:43:57Z] <steve_munene> batch restarting varnishkafka-eventlogging.service in batches of 3 30 seconds in between T323771

Mentioned in SAL (#wikimedia-operations) [2022-12-08T10:56:37Z] <steve_munene> batch restarting varnishkafka-statsv.service in batches of 3 30 seconds in between T323771

Mentioned in SAL (#wikimedia-traffic) [2022-12-08T10:56:46Z] <steve_munene> batch restarting varnishkafka-statsv.service in batches of 3 30 seconds in between T323771

batch restarting varnishkafka-eventlogging.service to pick new certs.

stevemunene@cumin1001:~$ sudo cumin -b 3 -s 30 P:cache::kafka::eventlogging "systemctl restart varnishkafka-eventlogging.service"
48 hosts will be targeted:
cp[2027,2029,2031,2033,2035,2037,2039,2041].codfw.wmnet,cp[6009-6016].drmrs.wmnet,cp[1075,1077,1079,1081,1083,1085,1087,1089].eqiad.wmnet,cp[5017-5024].eqsin.wmnet,cp[3050,3052,3054,3056,3058,3060,3062,3064].esams.wmnet,cp[4037-4044].ulsfo.wmnet
Ok to proceed on 48 hosts? Enter the number of affected hosts to confirm or "q" to quit 48                                                                                                                                               
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (48/48) [08:29<00:00, 10.62s/hosts]
FAIL |                                                                                                                                |   0% (0/48) [08:29<?, ?hosts/s]
100.0% (48/48) success ratio (>= 100.0% threshold) for command: 'systemctl restar...tlogging.service'.
100.0% (48/48) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
stevemunene@cumin1001:~$

batch restarting varnishkafka-statsv.service in batches of 3 30 seconds in between

stevemunene@cumin1001:~$ sudo cumin -b 3 -s 30 P:cache::kafka::statsv "systemctl restart varnishkafka-statsv.service"
48 hosts will be targeted:
cp[2027,2029,2031,2033,2035,2037,2039,2041].codfw.wmnet,cp[6009-6016].drmrs.wmnet,cp[1075,1077,1079,1081,1083,1085,1087,1089].eqiad.wmnet,cp[5017-5024].eqsin.wmnet,cp[3050,3052,3054,3056,3058,3060,3062,3064].esams.wmnet,cp[4037-4044].ulsfo.wmnet
Ok to proceed on 48 hosts? Enter the number of affected hosts to confirm or "q" to quit 48                                                                                                                                                
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (48/48) [08:23<00:00, 10.49s/hosts]
FAIL |                                                                                                                                |   0% (0/48) [08:23<?, ?hosts/s]
100.0% (48/48) success ratio (>= 100.0% threshold) for command: 'systemctl restar...a-statsv.service'.
100.0% (48/48) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Mentioned in SAL (#wikimedia-operations) [2022-12-08T11:09:53Z] <steve_munene> batch restarting varnishkafka-webrequest.service in batches of 3 30 seconds in between T323771

Mentioned in SAL (#wikimedia-traffic) [2022-12-08T11:10:18Z] <steve_munene> batch restarting varnishkafka-webrequest.service in batches of 3 30 seconds in between T323771

batch restarting varnishkafka-webrequest.service in batches of 3 30 seconds in between

stevemunene@cumin1001:~$ sudo cumin -b 3 -s 30 P:cache::kafka::webrequest "systemctl restart varnishkafka-webrequest.service"
96 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4037-4052].ulsfo.wmnet
Ok to proceed on 96 hosts? Enter the number of affected hosts to confirm or "q" to quit 96
===== NO OUTPUT =====                                                                                                                                                  
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (96/96) [17:17<00:00, 10.81s/hosts]
FAIL |                                                                                                                                |   0% (0/96) [17:17<?, ?hosts/s]
100.0% (96/96) success ratio (>= 100.0% threshold) for command: 'systemctl restar...brequest.service'.
100.0% (96/96) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
stevemunene@cumin1001:~$

All varnishkafka services successfully restarted to use the new certificates.