Page MenuHomePhabricator

Move varnishkafka to PKI
Closed, ResolvedPublic

Description

This is similar to T337248, it would be great to have varnishkafka instances on cache nodes to use PKI-based TLS certificates when connecting to Kafka brokers. At the moment we use a cergen certificate, that has CN: varnishkafka, used by Kafka brokers as username when evaluating ACLs (for example, only varnishkafka can produce messages to webrequest topics).

I filed 3 code reviews starting from https://gerrit.wikimedia.org/r/c/operations/puppet/+/924506, to do the following:

  1. Create a catch-all systemd unit called varnishkafka-all that restarts all the varnishkafka instances present on a cache node.
  2. Add the possibility to provision a PKI TLS certificate with CN:varnishkafka on cache nodes.
  3. Apply the settings to cp4037 (depool the node, send some test traffic, check error logs and if msgs are landing to Kafka, etc..).

If the above works we could then apply the setting to all cache nodes incrementally. Lemme know your thoughts!

Event Timeline

Change 924506 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] varnishkafka: add catch all systemd unit

https://gerrit.wikimedia.org/r/924506

Change 924507 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::cache::kafka: add support for PKI

https://gerrit.wikimedia.org/r/924507

Change 924509 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move cp4037's varnishkafka instances to PKI

https://gerrit.wikimedia.org/r/924509

Change 924506 merged by Elukey:

[operations/puppet@production] varnishkafka: add catch all systemd unit

https://gerrit.wikimedia.org/r/924506

Mentioned in SAL (#wikimedia-operations) [2023-06-07T15:23:09Z] <elukey> all varnishkafka instances on caching nodes are getting restarted due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/928087 - T337825

Mentioned in SAL (#wikimedia-analytics) [2023-06-07T15:23:12Z] <elukey> all varnishkafka instances on caching nodes are getting restarted due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/928087 - T337825

The new varnishkafka-all unit is being rolled out across all cp nodes.

Next steps:

Change 924507 merged by Elukey:

[operations/puppet@production] profile::cache::kafka: add support for PKI

https://gerrit.wikimedia.org/r/924507

Change 924509 merged by Elukey:

[operations/puppet@production] Move cp4037's varnishkafka instances to PKI

https://gerrit.wikimedia.org/r/924509

Change 928846 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::cache::kafka::certificate: use root instead of the kafka user

https://gerrit.wikimedia.org/r/928846

Change 928846 merged by Elukey:

[operations/puppet@production] profile::cache::kafka::certificate: use root instead of the kafka user

https://gerrit.wikimedia.org/r/928846

Jun 09 14:05:42 cp4037 varnishkafka[3568251]: %3|1686319542.526|FAIL|varnishkafka#producer-1| [thrd:ssl://kafka-jumbo1009.eqiad.wmnet:9093/bootstrap]: ssl://kafka-jumbo1009.eqiad.wmnet:9093/bootstrap: SSL handshake failed: ../ssl/statem/statem_clnt.c:393: error:141A10F4:SSL routines:ossl_statem_client_read_transition:unexpected message: client SSL authentication might be required (see ssl.key.location and ssl.certificate.location and consult the broker logs for more information) (after 221ms in state CONNECT, 8 identical error(s) suppressed)

The above happened on cp4037 (depooled). After seeing https://github.com/confluentinc/librdkafka/wiki/Using-SSL-with-librdkafka#configure-librdkafka-clientvar I have the suspicion that librdkafka wants a .pem file, or a key file, not a keystore.

Change 928854 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::cache::kafka::certificate: fix client PKI config

https://gerrit.wikimedia.org/r/928854

Change 928862 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move cp4037's varnishkafka instances to PKI

https://gerrit.wikimedia.org/r/928862

Change 928854 merged by Elukey:

[operations/puppet@production] profile::cache::kafka::certificate: fix client PKI config

https://gerrit.wikimedia.org/r/928854

Change 928862 merged by Elukey:

[operations/puppet@production] Move cp4037's varnishkafka instances to PKI

https://gerrit.wikimedia.org/r/928862

I've replicated a successful mTLS handshake with openssl s_client using the following CMD:

vgutierrez@cp4037:~$ sudo openssl s_client -connect kafka-jumbo1001.eqiad.wmnet:9093  -cert /etc/varnishkafka/ssl/kafka__varnishkafka_kafka_11.pem -key /etc/varnishkafka/ssl/kafka__varnishkafka_kafka_11-key.pem -curves prime256v1 -cipher ECDHE-ECDSA-AES256-GCM-SHA384 -cert_chain /etc/varnishkafka/ssl/kafka__varnishkafka_kafka_11.chain.pem -sigalgs ECDSA+SHA256

the problem here is that librdkafka doesn't have an analogue version to -cert_chain but it attempts to load the file from the same PEM file that contains the cert file:

if (rk->rk_conf.ssl.cert_location) {
        rd_kafka_dbg(rk, SECURITY, "SSL",
                     "Loading public key from file %s",
                     rk->rk_conf.ssl.cert_location);

        r = SSL_CTX_use_certificate_chain_file(
            ctx, rk->rk_conf.ssl.cert_location);

        if (r != 1) {
                rd_snprintf(errstr, errstr_size,
                            "ssl.certificate.location failed: ");
                return -1;
        }
}

SSL_CTX_use_certificate_chain_file() documentation says:

SSL_CTX_use_certificate_chain_file() loads a certificate chain from file into ctx. The certificates must be in PEM format and must be sorted starting with the subject's certificate (actual client or server certificate), followed by intermediate CA certificates if applicable, and ending at the highest level (root) CA

must be sorted starting with the subject's certificate BUT kafka__varnishkafka_kafka_11.chained.pem starts with the intermediate CA one rather than the subject's certificate as expected by OpenSSL

Change 929619 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move varnishkafka instances on cp4037 to PKI

https://gerrit.wikimedia.org/r/929619

Change 929619 merged by Elukey:

[operations/puppet@production] Move varnishkafka instances on cp4037 to PKI

https://gerrit.wikimedia.org/r/929619

Mentioned in SAL (#wikimedia-operations) [2023-06-13T07:10:08Z] <elukey> move varnishkafka instances on cp4037 to PKI TLS certs - T337825

All vk instances running on cp4037, next steps:

  1. Monitor cp4037 to verify that nothing explodes.
  2. Extend the change to ulsfo and monitor.
  3. Extend the change to all other DCs and monitor.
  4. Clean up the old cert (Puppet master, Puppet CA, nodes, etc..)

Change 929963 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::cache::{text,upload}: move ulsfo varnishkafkas to PKI

https://gerrit.wikimedia.org/r/929963

Change 929963 merged by Elukey:

[operations/puppet@production] role::cache::{text,upload}: move ulsfo varnishkafkas to PKI

https://gerrit.wikimedia.org/r/929963

Mentioned in SAL (#wikimedia-operations) [2023-06-15T09:05:40Z] <elukey> move varnishkafka instances in ulsfo to PKI - T337825

Next steps:

  • Roll out the changes to eqsin, and monitor.
  • Roll out the changes to codfw, and monitor.
  • Roll out the changes to eqiad, and monitor.
  • Roll out the changes to esams, and monitor.
  • Clean up the old cert (Puppet master, Puppet CA, nodes, etc..)

Change 930633 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::cache::{text,upload}: move vk instances to PKI in eqsin

https://gerrit.wikimedia.org/r/930633

Change 930633 merged by Elukey:

[operations/puppet@production] role::cache::{text,upload}: move vk instances to PKI in eqsin

https://gerrit.wikimedia.org/r/930633

Mentioned in SAL (#wikimedia-analytics) [2023-06-19T14:04:43Z] <elukey> move varnishafka instances in eqsin to PKI - T337825

Change 931498 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::cache::{text,upload}: move vk instances to PKI

https://gerrit.wikimedia.org/r/931498

Change 931498 merged by Elukey:

[operations/puppet@production] role::cache::{text,upload}: move vk codfw instances to PKI

https://gerrit.wikimedia.org/r/931498

Mentioned in SAL (#wikimedia-analytics) [2023-06-21T12:51:39Z] <elukey> move varnishafka instances in codfw to PKI - T337825

Change 932217 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move eqiad varnishkafka instances to PKI

https://gerrit.wikimedia.org/r/932217

Change 932218 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move esams varnishkafka instances to PKI

https://gerrit.wikimedia.org/r/932218

Change 932219 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move drmrs Varnishkafka instances to PKI

https://gerrit.wikimedia.org/r/932219

Change 932217 merged by Elukey:

[operations/puppet@production] Move eqiad varnishkafka instances to PKI

https://gerrit.wikimedia.org/r/932217

Mentioned in SAL (#wikimedia-analytics) [2023-06-22T13:16:57Z] <elukey> move varnishafka instances in eqiad to PKI - T337825

Change 932219 merged by Elukey:

[operations/puppet@production] Move drmrs Varnishkafka instances to PKI

https://gerrit.wikimedia.org/r/932219

Mentioned in SAL (#wikimedia-analytics) [2023-06-23T12:40:38Z] <elukey> move varnishkafka drmrs instances to pki - T337825

Change 932218 merged by Elukey:

[operations/puppet@production] Move esams varnishkafka instances to PKI

https://gerrit.wikimedia.org/r/932218

Mentioned in SAL (#wikimedia-analytics) [2023-06-26T14:06:01Z] <elukey> move varnishkafka instances in esams to pki - T337825

All varnishkafkas on PKI!

Remaining steps:

  • clean up the old certificate from puppet private and puppet CA.

Mentioned in SAL (#wikimedia-operations) [2023-06-27T08:38:42Z] <elukey> revoked puppet cert for 'varnishkafka' and cleaned up its cergen's files in puppet private - T337825

elukey claimed this task.

Change #1030052 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Simplify profile::cache::kafka::certificate to only support PKI/cfssl

https://gerrit.wikimedia.org/r/1030052