⚓ T271142 Some Service Operations clusters apparently do not support IPv6

	Subject	Repo	Branch	Lines +/-
	reports: network, remove rdb from no IPv6 list	operations/software/netbox-extras	master	+0 -1

		Status	Subtype	Assigned	Task
		Open		None	T253173 Some clusters do not have DNS for IPv6 addresses (TRACKING TASK)
		Open		None	T271142 Some Service Operations clusters apparently do not support IPv6

• crusnov created this task.Jan 4 2021, 6:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 4 2021, 6:55 PM

Aklapper added a project: IPv6.Jan 4 2021, 7:42 PM

kubernetes[1007-1014].eqiad.wmnet

As I see it these hosts do have IPv6 in DNS (and netbox).

Please let's not add IPV6 to hosts blindly, each group would need to be verified independently to ensure enabling ipv6 would not break firewall rules (are they even set up?) / grants.

I would err on the side of "we don't need ipv6 in the applayer, and we should not use it until we do", and so decline this task, but I'll wait for others in my team to chime in too. @akosiaris what do you think about this?

Joe triaged this task as Low priority.Jan 5 2021, 10:01 AM

Specifically, the following clusters are reached directly by hostname (and not via LVS) and will need special care:

mc* (including mc-gp*)
rdb*
restbase*
sessionstore*

For those classes of hosts, I'd prefer to be careful adding the IPv6 resolution without prior testing/evaluation.

For the rest of the machines above, that are only accessed via LVS, I don't see any problem in adding AAAA records for the hostname. But also I'm not sure why would that be useful.

Also, the mwlog* servers are not under serviceops care, you should probably ping the observability team instead :)

• crusnov updated the task description. (Show Details)Mar 27 2021, 12:06 AM

• crusnov updated the task description. (Show Details)

Thanks for following up.

Here's a survey of the hosts in question:

mc and mc-gp - the memcache processes are only listening on ipv4
dumpsdata - appears to be listening on ipv6 for all pertinent services
mw - aside from nutcracker all services appear to be listening on ipv6 [since these hosts are accessed through traffic layer i'm assuming this is one we should just not worry about right now?]
parse - as above, except also memcache doesn't seem to be listening on ipv6
rdb - redis only seems to be listening on ipv4
restbase - major services only seem to listen on ipv4
scb - major services only seem to listen on ipv4
sessionstore - major services only seem to listen on ipv4
snapshot - major services only seem to listen on ipv4
thumbor - major services only seem to listen on ipv4
wtp - major services only seem to listen on ipv4
scandium - some services do listen on ipv6, envoy not

If you'd prefer to decline this ticket it seems perfectly reasonable to me. :) @akosiaris feel free to do so.

akosiaris updated the task description. (Show Details)Mar 30 2021, 8:53 AM

Hi!

TL;DR: Aside from snapshot and dumpsdata that @ArielGlenn is better equipped to answer for, maybe we can do scandium and rdb* but the rest are either high risk/just not worth it/both

Longer version:

In T271142#6954372, @crusnov wrote:

Thanks for following up.

Here's a survey of the hosts in question:

mc and mc-gp - the memcache processes are only listening on ipv4

There is also a redis there, the one powering the MainStash. It's also listening on IPv4 only. Prometheus is listening on IPv6 though. This cluster is a backend cluster for the entirety of the sites. If something goes wrong with the process (e.g. wrong IPv6 addresses are added as AAAA records making connections to it timeout before they fallback to IPv4), all sites will go down as a result.

dumpsdata - appears to be listening on ipv6 for all pertinent services

I 'd defer to @ArielGlenn for this one.

mw - aside from nutcracker all services appear to be listening on ipv6 [since these hosts are accessed through traffic layer i'm assuming this is one we should just not worry about right now?]

They are the most worrying ones in fact as they power the sites and in one way or another they are intertwined with pretty much every service or application that exists in our infrastructure. Adding IPv6 DNS entries for those is something that has a great potential for pain. Furthermore, with the migration of mediawiki to kubernetes, this cluster is (in the mid-term - that is 1-1.5 years from now probably) going to be folded into the kubernetes clusters, so any kind of effort would be going down the drain.

parse - as above, except also memcache doesn't seem to be listening on ipv6

Those are essentially the same as mw* boxes, so the same applies.

rdb - redis only seems to be listening on ipv4

This one is probably doable. It powers 4 applications only (https://wikitech.wikimedia.org/wiki/Redis), which means we can go relatively easily and test.

restbase - major services only seem to listen on ipv4

In fact, the major service (nodejs) is listening on IPv6. With the advent of the api-gateway restbase's future is to eventually be deprecated and removed (see https://www.mediawiki.org/wiki/Core_Platform_Team/Decisions_Architecture_Research_Documentation/Services_Architecture_Recommendations_(2019)#Recommendations_2). Also, this service is making connections to multiple other services (see the diagram in https://www.mediawiki.org/wiki/RESTBase) and multiple other services are making connections to it. It's a constant puzzling factor during outages or alerts. I 'd much rather we did not add extra complexity to it, especially if it's going away.

scb - major services only seem to listen on ipv4

Those are done for, will soon be gone for good and hopefully never to return. Decommissioning tasks have been filed already.

sessionstore - major services only seem to listen on ipv4

The problem with this one is that the cassandra nodes talk to each other over the network for their gossip protocol and use DNS for their seeds. If we are to add AAAA records we need to make sure that ferm is also updated to allow IPv6 (it doesn't currently) so that we don't end having unexpected latency issues during gossiping.

snapshot - major services only seem to listen on ipv4

I 'd defer to @ArielGlenn for this one.

thumbor - major services only seem to listen on ipv4

Same as mw* boxes. the service will be migrated to kubernetes and the boxes folded into the kubernetes clusters (which is IPv6 capable already).

wtp - major services only seem to listen on ipv4

Those are being renamed to the parse* clusters. Furthermore they are essentially just mw* boxes these days, so the same as mw* applies.

scandium - some services do listen on ipv6, envoy not

That one is a testing server. I think we can do that without much pain.

If you'd prefer to decline this ticket it seems perfectly reasonable to me. :) @akosiaris feel free to do so.

Having been duly poked, may I ask which services you are looking at, both for dumpsdata* and snapshot*, that listen only on IPv4? Then we can talk about whether they should also handle IPv6 and if so, how to get there. Thanks!

ArielGlenn added a project: Dumps-Generation.Mar 30 2021, 11:50 AM

In T271142#6955077, @akosiaris wrote:

restbase - major services only seem to listen on ipv4

In fact, the major service (nodejs) is listening on IPv6. With the advent of the api-gateway restbase's future is to eventually be deprecated and removed (see https://www.mediawiki.org/wiki/Core_Platform_Team/Decisions_Architecture_Research_Documentation/Services_Architecture_Recommendations_(2019)#Recommendations_2). Also, this service is making connections to multiple other services (see the diagram in https://www.mediawiki.org/wiki/RESTBase) and multiple other services are making connections to it. It's a constant puzzling factor during outages or alerts. I 'd much rather we did not add extra complexity to it, especially if it's going away.

Just to further muddy the waters on this subject - Restbase itself does not natively support ipv6 yet. This shouldn't be a major change but is a known issue

In T271142#6955077, @akosiaris wrote:

Hi!

TL;DR: Aside from snapshot and dumpsdata that @ArielGlenn is better equipped to answer for, maybe we can do scandium and rdb* but the rest are either high risk/just not worth it/both

Ah amazing thank you for your thorough answer. I accept that most of these clusters aren't worth the risk for ipv6 for the time being.

With scandium is it worth it since it, too, is a parsoid node according to puppet, or are there other reasons to add the DNS?

rdb looks straight forward.

• crusnov updated the task description. (Show Details)Mar 30 2021, 3:19 PM

• crusnov updated the task description. (Show Details)

In T271142#6956040, @ArielGlenn wrote:

Having been duly poked, may I ask which services you are looking at, both for dumpsdata* and snapshot*, that listen only on IPv4? Then we can talk about whether they should also handle IPv6 and if so, how to get there. Thanks!

The base question is if it is safe under current circumstances to add AAAA DNS. Just eyeballing the list of services on the boxes it looked like they all are already on IPv6, and the request for information is if this assessment is correct (and thus it is safe to add AAAA DNS) and if it isn't what do we need to do so to make it correct (and if that work is worth it).

THanks :)

In T271142#6957026, @crusnov wrote:

In T271142#6956040, @ArielGlenn wrote:

Having been duly poked, may I ask which services you are looking at, both for dumpsdata* and snapshot*, that listen only on IPv4? Then we can talk about whether they should also handle IPv6 and if so, how to get there. Thanks!

The base question is if it is safe under current circumstances to add AAAA DNS. Just eyeballing the list of services on the boxes it looked like they all are already on IPv6, and the request for information is if this assessment is correct (and thus it is safe to add AAAA DNS) and if it isn't what do we need to do so to make it correct (and if that work is worth it).

THanks :)

I don't think it will make a difference but let's start with a testbed host, snapshot1005, and after it's got the entry I'll run some tests there. If that checks out, as I expect, we can do all the snapshots. I'll have to think about safe tests for the dumpsdata hosts, but I have an idea there too for after the snapshots.

In T271142#6957437, @ArielGlenn wrote:

In T271142#6957026, @crusnov wrote:

In T271142#6956040, @ArielGlenn wrote:

Having been duly poked, may I ask which services you are looking at, both for dumpsdata* and snapshot*, that listen only on IPv4? Then we can talk about whether they should also handle IPv6 and if so, how to get there. Thanks!

The base question is if it is safe under current circumstances to add AAAA DNS. Just eyeballing the list of services on the boxes it looked like they all are already on IPv6, and the request for information is if this assessment is correct (and thus it is safe to add AAAA DNS) and if it isn't what do we need to do so to make it correct (and if that work is worth it).

THanks :)

I don't think it will make a difference but let's start with a testbed host, snapshot1005, and after it's got the entry I'll run some tests there. If that checks out, as I expect, we can do all the snapshots. I'll have to think about safe tests for the dumpsdata hosts, but I have an idea there too for after the snapshots.

Sounds good, if you'd like to ping me on IRC when you want to do this project we can do at least the first stage.

In T271142#6957794, @crusnov wrote:

Sounds good, if you'd like to ping me on IRC when you want to do this project we can do at least the first stage.

okay as discussed I have added ipv6 DNS for snapshot1005. Thanks for your help @ArielGlenn

In T271142#6957884, @crusnov wrote:

In T271142#6957794, @crusnov wrote:

Sounds good, if you'd like to ping me on IRC when you want to do this project we can do at least the first stage.

okay as discussed I have added ipv6 DNS for snapshot1005. Thanks for your help @ArielGlenn

As expected, everything looks fine on snapshot1005. Feel free to proceed to the rest of the snapshot hosts whenever you like.

In T271142#6957011, @crusnov wrote:

In T271142#6955077, @akosiaris wrote:

Hi!

TL;DR: Aside from snapshot and dumpsdata that @ArielGlenn is better equipped to answer for, maybe we can do scandium and rdb* but the rest are either high risk/just not worth it/both

Ah amazing thank you for your thorough answer. I accept that most of these clusters aren't worth the risk for ipv6 for the time being.

With scandium is it worth it since it, too, is a parsoid node according to puppet, or are there other reasons to add the DNS?

It's a testing node, I think we can accept the risk there. From my side, it's fine to add them, it might even uncover some issues in our puppetization. Adding however @ssastry and @Dzahn so they are aware.

rdb looks straight forward.

Yes it does. ferm configuration is just opening up the ports there so it's IPv4/IPv6 agnostic and redis isn't listening on IPv6 there (it is capable though). OS stacks will fallback to IPv4 so everything should be fine. That being said, let's coordinate on this, ping me online before publishing the DNS entries so I can make sure nothing breaks.

• crusnov updated the task description. (Show Details)Mar 31 2021, 3:26 PM

In T271142#6960492, @akosiaris wrote:

rdb looks straight forward.

Yes it does. ferm configuration is just opening up the ports there so it's IPv4/IPv6 agnostic and redis isn't listening on IPv6 there (it is capable though). OS stacks will fallback to IPv4 so everything should be fine. That being said, let's coordinate on this, ping me online before publishing the DNS entries so I can make sure nothing breaks.

Thanks, I'll followup with a patch for redis to listen on ipv6 and coordinate with you on testing the DNS stuff after that.

jijiki added a subscriber: ssastry.Apr 1 2021, 8:10 AM

Aklapper added a project: Infrastructure-Foundations.Jun 21 2021, 9:00 PM

This task seems to have stalled after crusnov's departure; is someone else expecting to pick it up any time soon?

@ArielGlenn ideally the service owners, that surely know better what could be the effect of adding AAAA records for their services and what are their clients. Of course we (SRE I/F) are available to help. Let me know if you have any specific question/concern.

When I look at the netbox entries for the dumpdata and snapshot hosts, they all show ipv6 addresses listed. Can I assume these are from AAAA records and not dynamically generated ips as used to be the case?

@ArielGlenn Since we introduced Netbox as source of truth when provisioning a new host both primary IPv4 and IPv6 are always generated and assigned to each hosts and Puppet configure them on the hosts. The provisioning script has a flag to skip setting the DNS Name of the IPv6 in Netbox to prevent the automatic generation of the AAAA record to support those services that still can't have AAAA DNS records. For hosts that were already provisioned at the time of the migration we imported the existing data from the hosts and the DNS, so if there was an AAAA record for a given IPv6 the DNS Name in Netbox was set, otherwise was left empty.

This is because clients would default to IPv6 if the AAAA records are set and so adding the records without making sure that everything is setup for it might cause outages.

In the case of dumpsdata and snapshot hosts, they have the IPv6 but don't have the DNS Name set, so no automatic AAAA record is generated for them.
See for example:

If you want to add the AAAA DNS records for those you can follow:
https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_can_I_add_the_IPv6_AAAA/PTR_records_to_a_host_that_doesn't_have_it?

Ah rats, I was hoping against hope that the dns records at least for the snaps had been added before Cas's departure. Welp, I'll look into those soon. Thanks for the info.

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Jan 24 2022, 10:10 AM

[resuming this task, let me know if you instead prefer a separate one]
Some clusters managed by the ServiceOps team have inconsistent AAAA DNS records for the primary IPv6 of the hosts. Some hosts have the AAAA record in the DNS for their primary IPv6 address, some don't.
See https://wikitech.wikimedia.org/wiki/DNS/Netbox#Mixed_clusters for more details about the possible risks of the current setup and the two alternative actions to move forward.

This is the list of the affected clusters and related hosts as of 04/07/2022:

mc*:
- have the AAAA record: mc[1037-1054,2038-2041]
- lack the AAAA record: mc[2019-2037,2042-2055]

mw*:
- have the AAAA record: mw[1414-1498,2377-2419]
- lack the AAAA record: mw[1307-1413,2251-2255,2257-2376]

parse*:
- have the AAAA record: parse[1001-1024]
- lack the AAAA record: parse[2001-2020]

rdb*:
- have the AAAA record: rdb[2009-2010]
- lack the AAAA record: rdb[1009-1012,2007-2008]

ArielGlenn mentioned this in T312556: Some Core Platform clusters have inconsistent AAAA DNS records for the primary IPv6 of the hosts.Jul 12 2022, 7:06 AM

jijiki moved this task from Incoming 🐫 to 🙈🙉🙊Backlog on the serviceops board.Sep 28 2022, 2:24 PM

cmooney subscribed.Nov 29 2022, 3:02 PM

rdb*:
have the AAAA record: rdb[2009-2010]
lack the AAAA record: rdb[1009-1012,2007-2008]

rdb1009 and rdb1010 are going to be history pretty soon. In T326171, I 've finished all the work for moving their workloads to rdb1013 and rdb1014 and I 'll file the decom task on Monday.

With the above out of the way, we are at 50% AAAA records for this cluster with 0 issues, which signals that we can just add AAAA records for the rest of them.

jijiki subscribed.Dec 1 2023, 3:11 PM

Mentioned in SAL (#wikimedia-operations) [2023-12-01T15:58:12Z] <akosiaris> give AAAA and PTR records to scandium T271142

akosiaris updated the task description. (Show Details)Dec 1 2023, 3:58 PM

@Volans, since dumpsdata[1001-1003].eqiad.wmnet and snapshot[1005-1010].eqiad.wmnet are no longer with serviceops, I think we can resolve this one?

@akosiaris I see that:

mw[1349-1413]
mw[2259-2376]
mc[2042-2055]
parse[2001-2020]
mc-gp200[1-3]
mc-gp100[1-3]

are still missing AAAA records while other hosts within the same clusters have AAAA records. I didn't check all ServiceOps hosts, just picked from the list in the task description. If needed I can do a full survey.

Change 979399 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/netbox-extras@master] reports: network, remove rdb from no IPv6 list

https://gerrit.wikimedia.org/r/979399

gerritbot added a project: Patch-For-Review.Dec 1 2023, 5:16 PM

@Volans

All of these (which can be grouped in 2 just 2 categores, mw and mc, have been already deemed dangerous and out of scope per my T271142#6955077. I 'll check with the team but I doubt we have any intention of devoting work to do those.

@akosiaris sure, and having a cluster deemed as *not* IPv6 ready is totally ok.
The problem arises when the cluster is mixed, with some hosts with AAAA records and some without, as it is the case for the above clusters. As per T271142#8061841 and https://wikitech.wikimedia.org/wiki/DNS/Netbox#Mixed_clusters

In T271142#9378778, @Volans wrote:

@akosiaris sure, and having a cluster deemed as *not* IPv6 ready is totally ok.
The problem arises when the cluster is mixed, with some hosts with AAAA records and some without, as it is the case for the above clusters. As per T271142#8061841 and https://wikitech.wikimedia.org/wiki/DNS/Netbox#Mixed_clusters

mw* clusters have a good probability of not existing by July 2024, so any effort spent there is probably effort wasted.
mc* clusters are strictly addressed via IPv4 IPs (no hostnames at all) and only via mcrouter. IPv6 isn't useful there either, memcached is listening only on IPv4 anyway. There is only risk in adding manually IPv6 AAAA here, no benefit.

Another datapoint for the mw*/parse* clusters, they will be migrated to be k8s hosts, that are supposed to have AAAA records, and so you will randomly have k8s hosts with or without AAAA records depending if the mw host had it before or not.

Clement_Goubert mentioned this in T351074: Move servers from the appserver/api cluster to kubernetes.Dec 6 2023, 4:27 PM

Change 979399 merged by jenkins-bot:

[operations/software/netbox-extras@master] reports: network, remove rdb from no IPv6 list

https://gerrit.wikimedia.org/r/979399

Volans mentioned this in rOSNEfff30d9c173c: reports: network, remove rdb from no IPv6 list.Dec 18 2023, 10:29 AM

Maintenance_bot removed a project: Patch-For-Review.Dec 18 2023, 10:30 AM

In T271142#9382040, @Volans wrote:

Another datapoint for the mw*/parse* clusters, they will be migrated to be k8s hosts, that are supposed to have AAAA records, and so you will randomly have k8s hosts with or without AAAA records depending if the mw host had it before or not.

Will that be fixed during the renaming process that will follow later on? If yes, we probably can live with a few months of no AAAA records on some mw*, not yet renamed to k8s hosts.

Regarding the mc* hosts, I 've been mulling over this one for some time now trying to figure out the best solution. Having also this discussed with @Volans I 'll summarize our notes

The critical service memcached on this set of hosts isn't exposed over IPv6 currently. This is a configuration thing, the software LISTEN both on IPv4 and IPv6 sockets, albeit just on [::1]:1121[14] for IPv6.
The critical client of the critical service, mcrouter, is configured to talk strictly using IPv4 and bypassing DNS.
Other services, the ones used for management, .e.g. SSH, prometheus listen on IPv6 and are in fact access over IPv6 (e.g. prometheus)
An increasing set of hosts have had IPv6 AAAA record for a long while now, without any ill effects. The ones without AAAA are now the minority.
Overall, we agree that having a protocol that isn't actively used enabled is wrong as it increase both maintenance, as well as attack surface. It also makes reasoning when debugging more convoluted.
The overall preference is to have IPv6 AAAA records enabled. Exceptions are possible, but they should make sense.

Of the above, (1,2) + (5) is the issue. However part of it is mitigated by (3), leaving mostly the reasoning/debugging aspect, for which, we 'll adapt by documenting it more clearly.

(4) shows that risk is probably negligible after all. (6) poses the good argument that if we can't justify an exception, we shouldn't have one.

Given all the above and the fact that it was less work to add AAAA records to all hosts vs removing it from all hosts, I 've gone ahead and went with adding AAAA to all mc* hosts.

In T271142#9413333, @akosiaris wrote:

In T271142#9382040, @Volans wrote:

Another datapoint for the mw*/parse* clusters, they will be migrated to be k8s hosts, that are supposed to have AAAA records, and so you will randomly have k8s hosts with or without AAAA records depending if the mw host had it before or not.

Will that be fixed during the renaming process that will follow later on? If yes, we probably can live with a few months of no AAAA records on some mw*, not yet renamed to k8s hosts.

I added a note in T351074: Move servers from the appserver/api cluster to kubernetes so we should be checking if those records exist before moving a host to k8s and adding them if not, which will ultimately solve the problem once we're fully migrated. I agree we can wait for all of those to be done.

Some Service Operations clusters apparently do not support IPv6
Open, LowPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Some Service Operations clusters apparently do not support IPv6Open, LowPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Some Service Operations clusters apparently do not support IPv6
Open, LowPublic
Actions

Related Objects
Search...