Page MenuHomePhabricator

Some Observability clusters do not support IPv6.
Closed, ResolvedPublic

Description

Greetings!

During importation of DNS into Netbox as part of the transition to automation, we discovered some clusters do not have IPv6 DNS entries, which we interpreted as intentional (given that this was the mechanism used to prevent potential clients from accessing the IPv6 interfaces on the machine, if a given service did not support IPv6), and prevented from being imported into automation.

We are now triaging these clusters for their potential at supporting IPv6 in the future, so below are hosts which were left out of IPv6 DNS which we think that your team is responsible for. If you could take some time to put any information you have about supporting IPv6 on these clusters, specific plans for doing so, or if it will not in the forseeable future be possible to do so, it would be greatly appreciated!

If any of these machines don't belong to you let us know on this ticket or the parent task (T253173), thanks!

  • centrallog2002.codfw.wmnet
  • centrallog1001.eqiad.wmnet
  • graphite2003.codfw.wmnet
  • graphite1004.eqiad.wmnet
  • logstash[2001-2003,2026-2029].codfw.wmnet
  • logstash[1010-1012,1026-1029].eqiad.wmnet
  • mwlog2002.codfw.wmnet
  • mwlog1002.eqiad.wmnet

Event Timeline

Hello,

Is there a specific timeline you'd like us to meet with this? Mainly the goal is to understand urgency for prioritization. Thanks!

Hi!

So the idea is we'd like overall for all of our clusters to have IPv6 reachability. This is not terribly urgent, just a state that has remained a long time and we'd like to rectify.

The request is along the lines of, what domain-specific knowledge do we need to support IPv6 on these clusters, if possible - really if anything bad will happen if we add IPv6 DNS

For the above clusters, it looks like some of the other hosts in them already have IPv6 DNS, so they may be trivial. There are cases when changes to configuration or FERM need to be made to support it, which largely is what this task is asking about.

crusnov added a subscriber: fgiunchedi.

A quick survey of the clusters above:

  • centrallog[12]001 - is handling anycast syslog, which appears to be being mediated on ipv4 by bird if I'm reading this right. It doesn't seem like adding the host dns for ipv6 would be a problem because of this.
  • graphite - Seems to be only listening on ipv4 for most of its services. There are a lot of things running, and save for a very few are only on ipv4.
  • logstash - All services except for envoy appear to be listening on ipv6
  • thanos-fe - all services except for envoy appear to be listening on ipv6
  • mwlog - The logging services do not appear to be listening on ipv6
fgiunchedi renamed this task from Some Observability clusters apparently do not support IPv6. to Some Observability clusters do not support IPv6..Jul 20 2021, 9:54 AM

Since we're seem OK in the current state of gradually transitioning to AAAA for all newly-provisioned hosts, I think we can wait for "natural" hardware/OS upgrade lifecycle for these hosts to be gone and their replacements to have ipv6 records, thoughts ?

The approach makes sense to me to avoid duplicating efforts.

Volans subscribed.

Tentatively re-opening this for the mixed cluster specifically.

Some clusters managed by the Observability team have inconsistent AAAA DNS records for the primary IPv6 of the hosts. Some hosts have the AAAA record in the DNS for their primary IPv6 address, some don't.
See https://wikitech.wikimedia.org/wiki/DNS/Netbox#Mixed_clusters for more details about the possible risks of the current setup and the two alternative actions to move forward.

This is the list of the affected clusters and related hosts as of 04/07/2022:

  • kafka-logging*:
    • have the AAAA record: kafka-logging[2001-2003]
    • lack the AAAA record: kafka-logging[1001-1003]
  • logstash*:
    • have the AAAA record: logstash[2033-2035]
    • lack the AAAA record: logstash[1010-1012,1026-1029,1033-1035,2026-2029]
Volans raised the priority of this task from Low to High.Sep 3 2022, 9:36 AM

Raising priority for the inconsistent clusters highlighted in T271138#8061928, as they might cause issues.

  • kafka-logging*:
    • have the AAAA record: kafka-logging[2001-2003]
    • lack the AAAA record: kafka-logging[1001-1003]

IIRC these were sorted out some time ago, but didn't get logged as done on this task yet. Here's DNS as of today:

kafka-logging1001.eqiad.wmnet has address 10.64.16.205
kafka-logging1001.eqiad.wmnet has IPv6 address 2620:0:861:102:10:64:16:205
kafka-logging1002.eqiad.wmnet has address 10.64.32.142
kafka-logging1002.eqiad.wmnet has IPv6 address 2620:0:861:103:10:64:32:142
kafka-logging1003.eqiad.wmnet has address 10.64.48.66
kafka-logging1003.eqiad.wmnet has IPv6 address 2620:0:861:107:10:64:48:66

kafka-logging2001.codfw.wmnet has address 10.192.0.94
kafka-logging2001.codfw.wmnet has IPv6 address 2620:0:860:101:10:192:0:94
kafka-logging2002.codfw.wmnet has address 10.192.16.50
kafka-logging2002.codfw.wmnet has IPv6 address 2620:0:860:102:10:192:16:50
kafka-logging2003.codfw.wmnet has address 10.192.32.24
kafka-logging2003.codfw.wmnet has IPv6 address 2620:0:860:103:10:192:32:24

logstash-* hosts have been added to netbox and the sre.dns.netbox cookbook has been run.

colewhite updated the task description. (Show Details)

mwlog* hosts added to netbox and sre.dns.netbox cookbook has been run.

Volans lowered the priority of this task from High to Medium.Sep 14 2022, 4:16 PM

Thanks for fixing the mixed clusters, I'll lower back the priority to the default.

colewhite claimed this task.

All indicated hosts have ipv6 records now.

For the record: we had to revert the thanos-fe* v6 records yesterday since those are not v6 ready yet (cfr T317909). Not sure if we want to reopen this or a followup task? Either works!

colewhite updated the task description. (Show Details)

Change 875902 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] kafka-logging: we now support ipv6

https://gerrit.wikimedia.org/r/875902

Change 875902 merged by Giuseppe Lavagetto:

[operations/puppet@production] kafka-logging: we now support ipv6

https://gerrit.wikimedia.org/r/875902

fgiunchedi claimed this task.
fgiunchedi updated the task description. (Show Details)

Since we've split thanos bits into titan hosts, I've removed thanos-fe / thanos-be hosts from the description since they are not o11y anymore, resolving