Page MenuHomePhabricator

Some Data Persistence clusters apparently do not support IPv6
Open, MediumPublic

Description

Greetings!

During importation of DNS into Netbox as part of the transition to automation, we discovered some clusters do not have IPv6 DNS entries, which we interpreted as intentional (given that this was the mechanism used to prevent potential clients from accessing the IPv6 interfaces on the machine, if a given service did not support IPv6), and prevented from being imported into automation.

We are now triaging these clusters for their potential at supporting IPv6 in the future, so below are hosts which were left out of IPv6 DNS which we think that your team is responsible for. If you could take some time to put any information you have about supporting IPv6 on these clusters, specific plans for doing so, or if it will not in the forseeable future be possible to do so, it would be greatly appreciated!

If any of these machines don't belong to you let us know on this ticket or the parent task (T253173), thanks!

  • db[2071-2140].codfw.wmnet - will not pursue now
  • db[1074-1139,1141-1149].eqiad.wmnet
  • dbstore[1003-1005].eqiad.wmnet
  • pc[2007-2010].codfw.wmnet
  • pc[1007-1010].eqiad.wmnet

Media storage:

  • ms-be[1028-1033,1035-1059,2028-2057]

Event Timeline

  • heze.codfw.wmnet will be deprecated (planned in Q3).
  • ms-be and ms-fe are media storage and not managed by us (yet). @fgiunchedi, could you comment on these?
  • dbproxy hosts are owned by WMCS, @Bstorm, could you comment on these?
LSobanski moved this task from Triage to Refine on the DBA board.
LSobanski added subscribers: Marostegui, Kormat.
LSobanski renamed this task from Some Data Persistence clusters apparently do not support IPv6 to Some Data Persistence DB clusters apparently do not support IPv6.Jan 4 2021, 7:47 PM
LSobanski triaged this task as Medium priority.

The databases are mostly blocked on the grants audit and cleanup, which is not an easy task T270101

  • ms-be and ms-fe are media storage and not managed by us (yet). @fgiunchedi, could you comment on these?

I'm fairly sure ms-be hosts can have their ipv6 added to DNS and things should work (modulo ferm / swift daemons reload perhaps). For ms-fe hosts things should similarly work I think (LVS should be fine since we're talking host addresses not service ip addresses). The thanos cluster hosts (which run swift among other things) have ipv6 for the most part and work as expected.

@fgiunchedi Is there any process we should follow to test/make sure everything is okay if we add ipv6 DNS for ms-be and ms-fe?

@fgiunchedi Is there any process we should follow to test/make sure everything is okay if we add ipv6 DNS for ms-be and ms-fe?

The easiest I think would be to:

  1. Add AAAA for one ms-fe and one ms-be hosts in codfw (less traffic), this I believe is safe as far as swift/lvs is concerned: swift addresses are v4 only and configured statically, and lvs address (ms-fe.svc) shouldn't be affected anyways (?)
  2. Check swift logs for obvious errors
  3. Add AAAAs for all ms-fe/ms-be codfw hosts, and check for obvious errors
  4. Add AAAAs for ms-fe/ms-be in eqiad

AFAICT at least on ms-be hosts I don't see the swift processes listen on v6, shouldn't that be addressed too?

I think this one's done from the DBA perspective. Let me know if you think it's not the case.

I'm re-using this task because already on the topic, let me know if you prefer a separate one instead.

Some clusters managed by the Data Persistence team have inconsistent AAAA DNS records for the primary IPv6 of the hosts. Some hosts have the AAAA record in the DNS for their primary IPv6 address, some don't.
See https://wikitech.wikimedia.org/wiki/DNS/Netbox#Mixed_clusters for more details about the possible risks of the current setup and the two alternative actions to move forward.

This is the list of the affected clusters and related hosts as of 04/07/2022:

  • ms-be*:
    • have the AAAA record: ms-be[1060-1071,2058-2069]
    • lack the AAAA record: ms-be[1028-1033,1035-1059,2028-2057]
  • db*:

    I know that DBs are out of scope for now for IPv6 support, but I'd like to hightlight that there is one that has the AAAA record, and is catch by the refactored Netbox report. Is this host special in any way and it's ok that it has the AAAA record?
    • have the AAAA record: db1108 (mariadb::misc::analytics::backup role)
    • lack the AAAA record: all the rest of the DBs.
Volans renamed this task from Some Data Persistence DB clusters apparently do not support IPv6 to Some Data Persistence clusters apparently do not support IPv6.Jul 7 2022, 2:23 PM
Volans updated the task description. (Show Details)

I created T320947 for the ms-be hosts.

db1108 is a Data Engineering host, @BTullis is an AAAA record expected here?

Any update for the ms-be cluster that is still mixed? Can it be migrated to all have IPv6?

I also noticed that of all the dbproxy hosts, only one (dbproxy1019) has an AAAA record, and could potentially be an issue.

As those are now data-persistence, we have also:

  • Restbase:
    • restbase[2013-2023], restbase[1019-1030]: no AAAA record
    • restbase[2024-2035], restbase103[1-3]: have AAAA record
  • Thanos-fe:
    • thanos-fe[2001-2003], thanos-fe[1001-1003]: no AAAA record
    • thanos-fe1004, thanos-fe2004: have AAAA record