Page MenuHomePhabricator

Switch to encrypted kafka for coal/navtiming/statsv
Closed, ResolvedPublic

Description

While auditing cross-datacenter traffic in T286038 I came across a few plaintext kafka connections, including webperf2001 talking to kafka main/jumbo:

root@webperf2001:~# ps fwwwaux | grep -i 9092
nobody   24456  3.2  0.9 172636 38748 ?        Ssl  Jul29 1584:59 python3 /srv/deployment/performance/coal/run_coal.py --brokers kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092,kafka-jumbo1007.eqiad.wmnet:9092,kafka-jumbo1008.eqiad.wmnet:9092,kafka-jumbo1009.eqiad.wmnet:9092 --consumer-group coal_codfw --schema NavigationTiming --schema SaveTiming --schema PaintTiming --graphite-host graphite-in.eqiad.wmnet --graphite-port 2003 --graphite-prefix coal
root     21191  0.0  0.0  12780   980 pts/0    S+   08:57   0:00                      \_ grep -i 9092
nobody   22495 36.1  9.3 451380 377276 ?       Ssl  Aug19 6609:33 /usr/bin/python3 /srv/deployment/performance/navtiming/run_navtiming.py --brokers kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092,kafka-jumbo1007.eqiad.wmnet:9092,kafka-jumbo1008.eqiad.wmnet:9092,kafka-jumbo1009.eqiad.wmnet:9092 --consumer-group navtiming --statsd-host statsd.eqiad.wmnet --statsd-port 8125
nobody   21162  1.1  0.5 153132 23240 ?        Ssl  08:57   0:00 /usr/bin/python3 /srv/deployment/statsv/statsv/statsv.py --brokers kafka-main2001.codfw.wmnet:9092,kafka-main2002.codfw.wmnet:9092,kafka-main2003.codfw.wmnet:9092,kafka-main2004.codfw.wmnet:9092,kafka-main2005.codfw.wmnet:9092 --statsd 127.0.0.1:9125 --topics statsv
nobody   21163  0.0  0.4  72228 16256 ?        S    08:57   0:00  \_ /usr/bin/python3 /srv/deployment/statsv/statsv/statsv.py --brokers kafka-main2001.codfw.wmnet:9092,kafka-main2002.codfw.wmnet:9092,kafka-main2003.codfw.wmnet:9092,kafka-main2004.codfw.wmnet:9092,kafka-main2005.codfw.wmnet:9092 --statsd 127.0.0.1:9125 --topics statsv
nobody   21164  0.0  0.4  72228 16700 ?        S    08:57   0:00  \_ /usr/bin/python3 /srv/deployment/statsv/statsv/statsv.py --brokers kafka-main2001.codfw.wmnet:9092,kafka-main2002.codfw.wmnet:9092,kafka-main2003.codfw.wmnet:9092,kafka-main2004.codfw.wmnet:9092,kafka-main2005.codfw.wmnet:9092 --statsd 127.0.0.1:9125 --topics statsv

We should be switching these connections to encrypted kafka (and we can then set proper ACLs too)

Event Timeline

fgiunchedi triaged this task as Medium priority.Sep 2 2021, 8:07 AM

Change 721044 had a related patch set uploaded (by Dave Pifke; author: Dave Pifke):

[analytics/statsv@master] Add TLS support

https://gerrit.wikimedia.org/r/721044

Change 721047 had a related patch set uploaded (by Dave Pifke; author: Dave Pifke):

[operations/puppet@production] statsv: add TLS support

https://gerrit.wikimedia.org/r/721047

Change 721567 had a related patch set uploaded (by Dave Pifke; author: Dave Pifke):

[performance/navtiming@master] Add Kafka TLS support

https://gerrit.wikimedia.org/r/721567

Sigh. TLS isn't enabled for jumbo Kafka in the deployment-prep cluster (unlike jumbo Kafka in production).

It's really frustrating that there's no effort to keep deployment-prep and production in sync, as it makes things like this really difficult to test.

I've enabled the TLS listener in deployment-prep and confirmed the Navtiming patch works. Next up: Coal.

Change 722948 had a related patch set uploaded (by Dave Pifke; author: Dave Pifke):

[performance/coal@master] Add Kafka TLS support

https://gerrit.wikimedia.org/r/722948

Change 721567 merged by jenkins-bot:

[performance/navtiming@master] Add Kafka TLS support

https://gerrit.wikimedia.org/r/721567

Change 722948 merged by jenkins-bot:

[performance/coal@master] Add Kafka TLS support

https://gerrit.wikimedia.org/r/722948

Mentioned in SAL (#wikimedia-operations) [2021-09-30T23:39:12Z] <dpifke@deploy1002> Started deploy [performance/navtiming@29264fb]: Deploy Navtiming with Kafka TLS support (not yet enabled) T290131

Mentioned in SAL (#wikimedia-operations) [2021-09-30T23:39:19Z] <dpifke@deploy1002> Finished deploy [performance/navtiming@29264fb]: Deploy Navtiming with Kafka TLS support (not yet enabled) T290131 (duration: 00m 05s)

Mentioned in SAL (#wikimedia-operations) [2021-09-30T23:40:22Z] <dpifke@deploy1002> Started deploy [performance/coal@1be49f8]: Deploy Coal with Kafka TLS support (not yet enabled) T290131

Mentioned in SAL (#wikimedia-operations) [2021-09-30T23:41:29Z] <dpifke@deploy1002> Finished deploy [performance/coal@1be49f8]: Deploy Coal with Kafka TLS support (not yet enabled) T290131 (duration: 01m 07s)

Change 721044 merged by Dave Pifke:

[analytics/statsv@master] Add TLS support

https://gerrit.wikimedia.org/r/721044

Mentioned in SAL (#wikimedia-operations) [2021-09-30T23:48:01Z] <dpifke@deploy1002> Started deploy [statsv/statsv@afeff42]: Deploy statsv with Kafka TLS support (not yet enabled) T290131

Mentioned in SAL (#wikimedia-operations) [2021-09-30T23:48:08Z] <dpifke@deploy1002> Finished deploy [statsv/statsv@afeff42]: Deploy statsv with Kafka TLS support (not yet enabled) T290131 (duration: 00m 05s)

Change 721047 merged by RLazarus:

[operations/puppet@production] webperf: connect to Kafka using TLS

https://gerrit.wikimedia.org/r/721047

Confirmed all three services are now talking TLS to Kafka on webperf1001 and webperf2001.