Page MenuHomePhabricator

Move webrequest varnishkafka and consumers to Kafka jumbo cluster.
Closed, ResolvedPublic5 Estimated Story Points

Description

Task for ongoing work. See also T182993.

For webrequest_text:

  • FRack kafkatee, coordinate with Jeff Green for webrequest_text switchover
  • Restart Druid Tranquility Banner Impressions Spark Job consuming from jumbo

Cleanup: https://gerrit.wikimedia.org/r/#/c/416761/

  • Remove webrequest-analytics kafkatee instance from rhenium and oxygen and in puppet.
  • Remove webrequest-analytics camus job from analytics1003 and in puppet.

Event Timeline

Ottomata triaged this task as Medium priority.Jan 17 2018, 8:42 PM
Ottomata created this task.

@Jgreen FYI, we'll need to coordinate this soon :)

@Jgreen FYI, we'll need to coordinate this soon :)

No problem. I tried a little puppetspelunking but it's hard to see what we'll need to adjust. I'm guessing we're talking about a new pool of kafka hosts and also enabling TLS?

I'm guessing we're talking about a new pool of kafka hosts

Yup! Mostly just changing settings and bouncing the kafkatee instances, but we'll have to coordinate it. If yall use any offset storage features of kafkatee, we'll have to wipe those and start with new offsets.

and also enabling TLS?

Maybe! Are all the kafkatee consumers in eqiad? If so, we probably don't need to enable TLS. But we could!

I'm guessing we're talking about a new pool of kafka hosts

Yup! Mostly just changing settings and bouncing the kafkatee instances, but we'll have to coordinate it. If yall use any offset storage features of kafkatee, we'll have to wipe those and start with new offsets.

Sounds good, first step on the frack side is to whitelist the new hosts at the firewalls, can you point me to the list and I'll add a phabricator task?

and also enabling TLS?

Maybe! Are all the kafkatee consumers in eqiad? If so, we probably don't need to enable TLS. But we could!

Our active consumer is in eqiad, standby is in codfw, so I think we should strive for this. I think we're running a version that supports support TLS, so the tricky part will be getting the SSL certs and CA set up.

first step on the frack side is to whitelist the new hosts at the firewalls, can you point me to the list and I'll add a phabricator task?

kafka-jumbo100[1-6].eqiad.wmnet

I think we're running a version that supports support TLS

Great! Yeah we should get you the latest 0.11 librdkafka then. There's both stretch and jessie (in our jessie backports) versions available. It uses a more recent libssl 1.1.

the tricky part will be getting the SSL certs and CA set up.

The Kafka brokers are only configured with the Puppet CA cert as the CA. So your client cert will need to be signed by the ops Puppet CA. (Unless we decided to add another CA for FRack?). I've (hopefully) made this much easier! https://wikitech.wikimedia.org/wiki/Cergen

first step on the frack side is to whitelist the new hosts at the firewalls, can you point me to the list and I'll add a phabricator task?

kafka-jumbo100[1-6].eqiad.wmnet

I think we're running a version that supports support TLS

Great! Yeah we should get you the latest 0.11 librdkafka then. There's both stretch and jessie (in our jessie backports) versions available. It uses a more recent libssl 1.1.

the tricky part will be getting the SSL certs and CA set up.

The Kafka brokers are only configured with the Puppet CA cert as the CA. So your client cert will need to be signed by the ops Puppet CA. (Unless we decided to add another CA for FRack?). I've (hopefully) made this much easier! https://wikitech.wikimedia.org/wiki/Cergen

Frack puppet is independent with its own CA. As long as we can create certs for the two frack hosts with the production CA, I can work on configuring kafkatee to use them.

Another question, does kafka use a different port for TLS service?

Another question, does kafka use a different port for TLS service?

Yes, :9093.

As long as we can create certs for the two frack hosts with the production CA

Ya can do, although I'd just create one cert just for FRack kafkatee usage. We'll be using one cert for all varnishkafkas.

The librdkafka / kafkatee settings would then look something like:

kafka.metadata.broker.list = kafka-jumbo1001.eqiad.wmnet:9093,kafka-jumbo1002.eqiad.wmnet:9093,kafka-jumbo1003.eqiad.wmnet:9093,kafka-jumbo1004.eqiad.wmnet:9093,kafka-jumbo1005.eqiad.wmnet:9093,kafka-jumbo1006.eqiad.wmnet:9093
kafka.security.protocol=SSL
kafka.ssl.ca.location=/path/to/puppet/ca_certificate.crt.pem
kafka.ssl.key.password=xxxxxxxxx
kafka.ssl.key.location=/path/to/private.key.pem
kafka.ssl.certificate.location=/path/to/public/certificate.crt.pem
kafka.ssl.cipher.suites=ECDHE-ECDSA-AES256-GCM-SHA384
Ottomata updated the task description. (Show Details)

Change 409027 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::cache::misc: add a ad-hoc varnishkafka instance to test TLS

https://gerrit.wikimedia.org/r/409027

Change 409027 merged by Elukey:
[operations/puppet@production] role::cache::misc: add a ad-hoc varnishkafka instance to test TLS

https://gerrit.wikimedia.org/r/409027

Change 409085 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::cache::kafka::webrequest::jumbo: fix default kafka cluster

https://gerrit.wikimedia.org/r/409085

Change 409085 merged by Elukey:
[operations/puppet@production] profile::cache::kafka::webrequest::jumbo: fix default kafka cluster

https://gerrit.wikimedia.org/r/409085

From IRC:

13:36  <elukey> I wanted to check latencies between cp hosts and kafka brokers, for analytics and jumbo
13:36  <elukey> so I picked up the rtt librdkafka metric
13:36  <elukey> it is a graphite metric reporting min/max/avg
13:36  <elukey> so I plotted three graphs for plaintext (port 9092) and three for TLS (port 9093)
13:37  <elukey> avg(min), avg(avg), avg(max)
13:37  <elukey> https://grafana.wikimedia.org/dashboard/db/varnishkafka
13:37  <elukey> it is not super perfect but it is good to see the diff between the webrequest instance and the webrequest-duplicate-jumbo on
13:39  <elukey> as far as I can see all looks good
13:39  <elukey> I thought to use percentiles but it was a bit confusing imho, I wanted one (aggregated) metric for each broker
13:40  <elukey> anyhow, so far the test looks very good :)
Jgreen closed subtask Restricted Task as Resolved.Feb 15 2018, 8:03 PM

Yargh, @elukey. kafkatee. Gonna be weird on oxygen, since we can't make kafkatee consume from multiple kafka clusters at once. If we want to keep kafkatee working there in the same way it is now, we'll have to modify kafkatee puppet to support multiple instances. I still am not sure this will work, as I dunno what will happen if we pipe output to the same file from multiple instances.

I've started working on this...

Change 413237 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Set two webrequest kafkatee instances consuming from analytics and jumbo

https://gerrit.wikimedia.org/r/413237

Change 413237 merged by Ottomata:
[operations/puppet@production] Set two webrequest kafkatee instances consuming from analytics and jumbo

https://gerrit.wikimedia.org/r/413237

Ottomata updated the task description. (Show Details)

Change 413243 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Produce webrequest_misc logs to Kafka jumbo instead of Kafka analytics

https://gerrit.wikimedia.org/r/413243

Ottomata updated the task description. (Show Details)

Change 413370 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::cache::canary|misc: remove testing vk instance

https://gerrit.wikimedia.org/r/413370

Mentioned in SAL (#wikimedia-operations) [2018-02-22T14:41:15Z] <ottomata> beginning migration of webrequest_misc from Kafka analytics to jumbo: T185136

Change 413243 merged by Ottomata:
[operations/puppet@production] Produce webrequest_misc logs to Kafka jumbo instead of Kafka analytics

https://gerrit.wikimedia.org/r/413243

Change 413370 merged by Elukey:
[operations/puppet@production] role::cache::canary|misc: remove testing vk instance

https://gerrit.wikimedia.org/r/413370

Ottomata changed the point value for this task from 13 to 5.
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)

Change 415016 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Migrate webrequest upload varnishkafka to Kafka jumbo

https://gerrit.wikimedia.org/r/415016

Change 415016 merged by Ottomata:
[operations/puppet@production] Migrate webrequest upload varnishkafka to Kafka jumbo

https://gerrit.wikimedia.org/r/415016

@Jgreen, I think we are ready for webrequest_text. Can we find a time to do this together Tuesday March 6?

@Jgreen, I think we are ready for webrequest_text. Can we find a time to do this together Tuesday March 6?

Sure, that works.

Cool, we'll start working together on this at around 9:30am EST on Tuesday.

Change 415636 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Parameterize kafka_cluster_name for streams_check job

https://gerrit.wikimedia.org/r/415636

Change 416683 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Migrate webrequest text varnishkafka to Kafka jumbo

https://gerrit.wikimedia.org/r/416683

Change 415636 merged by Ottomata:
[operations/puppet@production] Parameterize kafka_cluster_name for streams_check job

https://gerrit.wikimedia.org/r/415636

Mentioned in SAL (#wikimedia-operations) [2018-03-06T14:36:33Z] <ottomata> beginning migration of webrequest text varnishkafka logs from Kafka analytics to Kafka jumbo-eqiad T185136

Change 416683 merged by Ottomata:
[operations/puppet@production] Migrate webrequest text varnishkafka to Kafka jumbo

https://gerrit.wikimedia.org/r/416683

Change 416761 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Remove Kafka analytics-eqiad webrequest camus and kafkatee instances

https://gerrit.wikimedia.org/r/416761

Change 416761 merged by Ottomata:
[operations/puppet@production] Remove Kafka analytics-eqiad webrequest camus and kafkatee instances

https://gerrit.wikimedia.org/r/416761

Ottomata updated the task description. (Show Details)
Ottomata moved this task from In Progress to Done on the Analytics-Kanban board.

Change 417308 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Remove force_protocol_version for cache webrequest varnishkafka

https://gerrit.wikimedia.org/r/417308

Change 417308 abandoned by Ottomata:
Remove force_protocol_version for cache webrequest varnishkafka

Reason:
Done in 9468a3970b08d3aa4beca6d91e00cd07abac4e7e

https://gerrit.wikimedia.org/r/417308

Change 425550 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache::ipsec: remove non-jumbo hosts from kafka::nodes

https://gerrit.wikimedia.org/r/425550

Change 425550 merged by Ema:
[operations/puppet@production] role::kafka::analytics: get rid of ipsec

https://gerrit.wikimedia.org/r/425550

Mentioned in SAL (#wikimedia-operations) [2018-04-20T06:26:57Z] <ema> kafka::analytics remove strongswan leftovers T185136