Page MenuHomePhabricator

Some elastic hosts do not have IPv6 DNS records
Closed, ResolvedPublic5 Estimated Story Points

Description

Greetings!

During importation of DNS into Netbox as part of the transition to automation, we discovered some clusters do not have IPv6 DNS entries, which we interpreted as intentional (given that this was the mechanism used to prevent potential clients from accessing the IPv6 interfaces on the machine, if a given service did not support IPv6), and prevented from being imported into automation.

We are now triaging these clusters for their potential at supporting IPv6 in the future, so below are hosts which were left out of IPv6 DNS which we think that your team is responsible for. If you could take some time to put any information you have about supporting IPv6 on these clusters, specific plans for doing so, or if it will not in the forseeable future be possible to do so, it would be greatly appreciated!

If any of these machines don't belong to you let us know on this ticket or the parent task (T253173), thanks!

  • elastic[2025-2060].codfw.wmnet
  • elastic[1032-1067].eqiad.wmnet
  • relforge[1001-1002].eqiad.wmnet
  • wdqs[2001-2008].codfw.wmnet
  • wdqs[1004-1013].eqiad.wmnet

Acceptance Criteria

  • All clusters are on IPv6

Event Timeline

Hello!

Here's a quick survey of the hosts listed above and maybe some potential problems in just adding AAAA DNS records to these clusters:

  • elastic - the nginx part seems to listen on ipv6, but the java process is not listening on ipv6.
  • es - these are listening on ipv6 but i'm given to believe there are some issues with mysql's grants for ipv6?
  • relforge - same as elastic above
  • wdqs - except for envoy it looks like everything listens on ipv6

Current status is that all newly provisioned hosts that are still in STAGED status in Netbox have the AAAA record for their primary IPv6 address, as opposed as the previously existing ones

  • elastic10[68-84]
  • elastic20[61-72]

For context, by default the provisioning assign the AAAA record to the primary IPv6 unless a specific checkbox "Skip IPv6 DNS record." is marked.
@Gehel @RKemper what is the correct setup for those hosts?

For the moment, elasticsearch is configured explicitly with the IPv4 address of the host. Internal cluster communication does not rely on DNS but on internal cluster discovery mechanism. nginx listens on IPv6 and proxies all external communication. So having an AAAA DNS entry should not break anything. That being said, we should configure IPv6 properly and do some testing.

Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html

bking renamed this task from Some Search Platform / Discovery clusters apparently do not support IPv6 to .Dec 21 2021, 3:39 PM
bking removed bking as the assignee of this task.
bking subscribed.
bking renamed this task from to Some Search Platform / Discovery clusters apparently do not support IPv6 .Dec 21 2021, 3:41 PM
Gehel triaged this task as High priority.Feb 22 2022, 8:06 PM
Gehel edited projects, added Discovery-Search; removed Discovery-Search (Current work).
Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.
bking renamed this task from Some Search Platform / Discovery clusters apparently do not support IPv6 to Some elastic hosts do not have IPv6 DNS entries.May 17 2022, 10:00 PM
bking renamed this task from Some elastic hosts do not have IPv6 DNS entries to Some elastic hosts do not have IPv6 DNS records.
bking claimed this task.

For clarity, client-side IPv6 connectivity to search functions in wikipedia, wikicommons, etc does not require the Elastic hosts themselves to communicate via IPv6, so I changed the title to reflect this.

Current status is that all newly provisioned hosts that are still in STAGED status in Netbox have the AAAA record for their primary IPv6 address, as opposed as the previously existing ones

  • elastic10[68-84]
  • elastic20[61-72]

For context, by default the provisioning assign the AAAA record to the primary IPv6 unless a specific checkbox "Skip IPv6 DNS record." is marked.
@Gehel @RKemper what is the correct setup for those hosts?

Since this comment, several of the AAAA-record-having hosts listed above have been moved successfully into production. So I believe we could add AAAA records to all hosts without an issue, but this is not a focus for our team at the moment.

If you do need us to add AAAA records to the older hosts, please reopen the ticket and we'll be happy to revisit.

Volans reopened this task as Open.EditedMay 18 2022, 8:28 AM

Re-opening because if there is no technical blocker for having the AAAA records on those hosts and your service are IPv6 ready [1], then we should add them to standardize our infrastructure and remove technical debt.

Based on how much you consider it a safe operation it can be done in either of these two ways:

  1. [safe mode] you pick one or more of them, follow the instructions in [2] and verify that all works fine before proceeding on the next
  2. [brave mode] you tell us the list of hosts for which is totally safe to add the AAAA records in batch and we add them to Netbox and propagate them to the DNS in a semi-automated way all at once.

[1] In the sense that the services are listening also on IPv6 address, eventual ferm rules are setup on the IPv6 too and clients can connect via IPv6 without issues, eventual ACLs are setup for IPv6 too, etc.
[2] https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_can_I_add_the_IPv6_AAAA/PTR_records_to_a_host_that_doesn't_have_it?

Hey Volans, sorry I didn't get to this by end of week as promised; I was sick on Weds and Thurs. Starting Monday, some combination of @RKemper and myself will do a few hosts in "safe mode" as described above. After 3 or 4 are confirmed working, we can finish off in "brave mode." Ryan or @Gehel if you have any objections to this plan, please let us know.

Sounds good to me. Thanks for the update :)

Mentioned in SAL (#wikimedia-operations) [2022-05-23T14:20:34Z] <inflatador> Add AAAA records to relforge1003 and 1004 T271143

Mentioned in SAL (#wikimedia-operations) [2022-05-23T16:28:19Z] <inflatador> adding AAAA records for cloudelastic100[1-6] T271143

Mentioned in SAL (#wikimedia-operations) [2022-05-23T16:44:16Z] <inflatador> add AAAA records to elastic202[5-9] T271143

AAAA records successfully added for elastic202[5-9]:

for n in $(cat codfw.hosts); do quad=$(dig aaaa +short ${n});printf "%s : %s\n" ${n} ${quad}; done
elastic2025.codfw.wmnet : 2620:0:860:101:10:192:0:77
elastic2026.codfw.wmnet : 2620:0:860:101:10:192:0:78
elastic2027.codfw.wmnet : 2620:0:860:101:10:192:0:79
elastic2028.codfw.wmnet : 2620:0:860:102:10:192:16:191
elastic2029.codfw.wmnet : 2620:0:860:102:10:192:16:192

@Volans , we are ready to do "brave mode" on the remaining CODFW hosts, which are listed here: https://phabricator.wikimedia.org/P28365

Let us know when these records are active. I'm planning on waiting 48 hours between the activation and rolling out AAAA records for EQIAD, but as always, let us know if you have other suggestions.

@Volans , we are ready to do "brave mode" on the remaining CODFW hosts, which are listed here: https://phabricator.wikimedia.org/P28365

Let us know when these records are active. I'm planning on waiting 48 hours between the activation and rolling out AAAA records for EQIAD, but as always, let us know if you have other suggestions.

That's great. The plan looks good to me, my only suggestion, in case it was not already done, is to check the logs to confirm that clients are connecting already via IPv6.
I'll ping you in your morning so that I can apply the changes for "brave mode" while you're around.

I've updated Netbox running the following code:

>>> import uuid
>>> request_id = uuid.uuid4()
>>> user = User.objects.get(username='volans')
>>> def update(d):
...     ip = d.primary_ip6
...     ip.dns_name = d.primary_ip4.dns_name
...     log = ip.to_objectchange('update')
...     log.request_id = request_id
...     log.user = user
...     log.save()
...     ip.save()
...
>>> devices = Device.objects.filter(name__startswith='elastic2', primary_ip6__dns_name='').exclude(status='offline')
>>> len(devices)
30
>>> [d.name for d in devices]
['elastic2030', 'elastic2031', 'elastic2032', 'elastic2033', 'elastic2034', 'elastic2036', 'elastic2037', 'elastic2038', 'elastic2039', 'elastic2040', 'elastic2041', 'elastic2042', 'elastic2043', 'elastic2044', 'elastic2045', 'elastic2046', 'elastic2047', 'elastic2048', 'elastic2049', 'elastic2050', 'elastic2051', 'elastic2052', 'elastic2053', 'elastic2054', 'elastic2055', 'elastic2056', 'elastic2057', 'elastic2058', 'elastic2059', 'elastic2060']
>>> for d in devices:
...     update(d)
...
>>>

The changes can be seen in https://netbox.wikimedia.org/extras/changelog/?request_id=a04c63d5-9c44-45e0-b9c7-a53f80e1c482

I've also run the sre.dns.netbox cookbook to propagate the changes (commit SHA1 in the generated dns repo: 391da274f8da206a002a41a42635f6a3ba25f0b3).

Eqiad is also done, pasting only the differences with the above snippet:

>>> devices = Device.objects.filter(name__startswith='elastic1', primary_ip6__dns_name='').exclude(status='offline')
>>> len(devices)
20
>>> [d.name for d in devices]
['elastic1048', 'elastic1049', 'elastic1050', 'elastic1051', 'elastic1052', 'elastic1053', 'elastic1054', 'elastic1055', 'elastic1056', 'elastic1057', 'elastic1058', 'elastic1059', 'elastic1060', 'elastic1061', 'elastic1062', 'elastic1063', 'elastic1064', 'elastic1065', 'elastic1066', 'elastic1067']

The changes can be seen in https://netbox.wikimedia.org/extras/changelog/?request_id=276c42e2-eb85-46cf-b307-3b83a37caa8b

I've also run the sre.dns.netbox cookbook to propagate the changes (commit SHA1 in the generated dns repo: ceedc8b16428b0c0de931758dd909a60e3ee55a3).

This is complete...closing!