Page MenuHomePhabricator

DRY kafka broker declaration in helmfiles
Open, MediumPublic

Description

Puppet has the authoritative list of Kafka brokers. Helmfiles that use Kafka hardcode that list in, and when SRE changes kafka brokers (like in T279342), helmfiles must be updated too. This is error prone and can lead to problems if SRE is not aware of what services depend on Kafka.

Potential solution:

  • Kafka cluster and broker info is rendered into a default helmfile values file on deployment server by puppet (just like the service mesh definitions) (see kafka_config.rb for how puppet does this)
  • A modules/app/kafka_1.0.0.tpl that creates define(s) for kafka_bootstrap_servers for a specified kafka clusters and ports
  • Charts can then opt in to using the define in their app specific configuration templates

Some context in T213561: Discovery for Kafka cluster brokers as well

Event Timeline

Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.

Change 656253 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Render kafka cluster connection info in helmfile-defaults/general-*.yaml

https://gerrit.wikimedia.org/r/656253

Change 656253 merged by Ottomata:
[operations/puppet@production] Render kafka cluster connection info in helmfile-defaults/general-*.yaml

https://gerrit.wikimedia.org/r/656253

Ottomata renamed this task from DRY kafka broker declaration into helmfiles from puppet to DRY kafka broker declaration in helmfiles.Apr 16 2021, 3:55 PM
Ottomata raised the priority of this task from Low to Medium.
Ottomata updated the task description. (Show Details)
Ottomata added projects: SRE, serviceops.
Ottomata added subscribers: herron, fgiunchedi, colewhite.

Actually, I'm not sure even just doing LVS would help here. The helmfiles networkpolicy explicitly lists IP addresses that the service can talk to. The broker IPs would still have to manually updated in networkpolicy in values.yaml file.

@akosiaris @JMeybohm any ideas?

Hi!

Adopting the new functionality in networkpolicy resources has indeed created some tech debt. It's a tech debt we created on purpose while devoting resources to finalize the migration away from the old way of maintaining those networkpolicies. Now that that's gone, I want to revisit it and deduplicate it as much as possible.

I have a couple of approaches for that in mind, I 'll try and upload a couple of changes this week.

Change 682971 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] kubernetes::deployment_server: also add kafka broker, pass CIDRs

https://gerrit.wikimedia.org/r/682971

Change 683379 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] networkpolicy: add autogenerated egress rules

https://gerrit.wikimedia.org/r/683379

Change 682971 merged by Giuseppe Lavagetto:

[operations/puppet@production] kubernetes::deployment_server: also add kafka broker, pass CIDRs

https://gerrit.wikimedia.org/r/682971

Change 683379 merged by jenkins-bot:

[operations/deployment-charts@master] networkpolicy: add autogenerated egress rules

https://gerrit.wikimedia.org/r/683379

Change 684855 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] eventgate: add kafka egress policy stanza

https://gerrit.wikimedia.org/r/684855

Change 930259 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-* - use kafka egress and service mesh

https://gerrit.wikimedia.org/r/930259

Change 930259 abandoned by Ottomata:

[operations/deployment-charts@master] eventgate-* - use kafka egress and service mesh

Reason:

Using the older I885eec19dcfbb759036ea9976f49155a0923ec48 patchset chain

https://gerrit.wikimedia.org/r/930259

Status update:

networkpolicy for Kafka brokers has been DRY, but referencing the hostnames for Kafka brokers for application config has not.

Ottomata updated the task description. (Show Details)

I believe that this ticket will be invalidated by the approach that that has tested and agreed upon in T331894: Improve how we address outside k8s infrastructure from within charts (e.g. network policies).
Therefore, we might want to decline this ticket.

+1, or add this as a subtask of that?

Either good with me!

Starting today (at least for the staging-codfw and dse-k8s-eqiad clusters), apps running in Kubernetes we can use Kubernetes service discovery to get the IPs of the Kafka clusters running outside of Kubernetes.

brouberol@deploy1002:~$ host kerberos-kdc.external-services 10.192.75.126
Using domain server:
Name: 10.192.75.126
Address: 10.192.75.126#53
Aliases:

Host kerberos-kdc.external-services not found: 3(NXDOMAIN)
brouberol@deploy1002:~$ host kerberos-kdc.external-services.svc.cluster.local 10.192.75.126
Using domain server:
Name: 10.192.75.126
Address: 10.192.75.126#53
Aliases:

kerberos-kdc.external-services.svc.cluster.local has address 10.192.48.190
kerberos-kdc.external-services.svc.cluster.local has address 10.64.0.112
kerberos-kdc.external-services.svc.cluster.local has IPv6 address 2620:0:861:101:10:64:0:112
kerberos-kdc.external-services.svc.cluster.local has IPv6 address 2620:0:860:104:10:192:48:190
brouberol@deploy1002:~$ host kafka-jumbo-eqiad.external-services.svc.cluster.local 10.192.75.126
Using domain server:
Name: 10.192.75.126
Address: 10.192.75.126#53
Aliases:

kafka-jumbo-eqiad.external-services.svc.cluster.local has address 10.64.136.11
kafka-jumbo-eqiad.external-services.svc.cluster.local has address 10.64.134.9
kafka-jumbo-eqiad.external-services.svc.cluster.local has address 10.64.32.106
kafka-jumbo-eqiad.external-services.svc.cluster.local has address 10.64.135.16
kafka-jumbo-eqiad.external-services.svc.cluster.local has address 10.64.130.10
kafka-jumbo-eqiad.external-services.svc.cluster.local has address 10.64.48.140
kafka-jumbo-eqiad.external-services.svc.cluster.local has address 10.64.131.16
kafka-jumbo-eqiad.external-services.svc.cluster.local has address 10.64.48.121
kafka-jumbo-eqiad.external-services.svc.cluster.local has address 10.64.132.21
kafka-jumbo-eqiad.external-services.svc.cluster.local has IPv6 address 2620:0:861:107:10:64:48:140
kafka-jumbo-eqiad.external-services.svc.cluster.local has IPv6 address 2620:0:861:107:10:64:48:121
kafka-jumbo-eqiad.external-services.svc.cluster.local has IPv6 address 2620:0:861:10e:10:64:135:16
kafka-jumbo-eqiad.external-services.svc.cluster.local has IPv6 address 2620:0:861:10f:10:64:136:11
kafka-jumbo-eqiad.external-services.svc.cluster.local has IPv6 address 2620:0:861:103:10:64:32:106
kafka-jumbo-eqiad.external-services.svc.cluster.local has IPv6 address 2620:0:861:109:10:64:130:10
kafka-jumbo-eqiad.external-services.svc.cluster.local has IPv6 address 2620:0:861:10d:10:64:134:9
kafka-jumbo-eqiad.external-services.svc.cluster.local has IPv6 address 2620:0:861:10a:10:64:131:16
kafka-jumbo-eqiad.external-services.svc.cluster.local has IPv6 address 2620:0:861:10b:10:64:132:21

This means that we can then leverage the default behavior of client.dns.lookup librdkafka configuration (introduced in v2.2.0: https://github.com/confluentinc/librdkafka/commit/961946e55fb3f89eb782d4011af4bf5cd3c31f17) to have the client tray all IPs when the DNS resolve to multiple IPs:

~/wmf/puppet production *7 ❯ kafkacfg query --source librdkafka 'name=client.dns.lookup' -v 2.2.0 | jq '.[0]'
{
  "name": "client.dns.lookup",
  "override": null,
  "range": "use_all_dns_ips, resolve_canonical_bootstrap_servers_only",
  "default": "use_all_dns_ips",
  "importance": "low",
  "description": "Controls how the client uses DNS lookups. By default, when the lookup returns multiple IP addresses for a hostname, they will all be attempted for connection before the connection is considered failed. This applies to both bootstrap and advertised servers. If the value is set to `resolve_canonical_bootstrap_servers_only`, each entry will be resolved and expanded into a list of canonical names. NOTE: Default here is different from the Java client's default behavior, which connects only to the first IP address returned for a hostname.  <br>*Type: enum value*",
  "scope": "consumer"
}

The same behavior has been added to the Java clients shipped with kafka 2.6 https://docs.cloudera.com/runtime/7.2.15/kafka-configuring/topics/kafka-client-dns-lookup-property.html