Page MenuHomePhabricator

Fix IPv6 autoconf issues once and for all, across the fleet.
Open, MediumPublic

Description

From convo w/ @faidon, a better path forward to get rid of IPv6 autoconf confusion in the present/future without relying on the token method (which doesn't work on precise anyways, and is complicated):

  1. Copy the original interface::add_ip6_mapped functionality (translate ipv4 into lower 64 bits, take upper 64 from either an already autoconf-configured address or from rdisc6) down to d-i so that it configures the explicit mapped v6 address at install-time for new installs. This puts the new hosts' v6 on the same footing as v4 is today: configured at install, left alone for runtime puppet. All new hosts installed under this scheme should have IPv6 added alongside IPv4 in DNS as well.
  2. Remove the current interface::add_ip6_mapped functionality from puppet for the hosts it's applied to, without undoing its basic work. That leaves the affected hosts with their static /e/n/i definition, and thus they're in a similar situation to fresh installs with the new hosts above. There's an extra complication here in that we also want to salt over these hosts and undo the effects of the ip token stuff in /e/n/i and be sure they're all left in a stable state with their configured static address: needs some testing.
  3. Deploy code similar to https://gerrit.wikimedia.org/r/#/c/217317/1 in a base class to all hosts (needs updates for non-upstart), which kills autoconf at boot time for all interfaces before the network service ever starts and flushes any current ones, and ensure it gets run at least once on current running hosts as well. This kills all autoconf addresses, and thus hosts that didn't get one from 1/2 above (old add_ip6_mapped hosts, or new installs) won't have IPv6 at all and will communicate with other dual stack hosts over v4 only, rather than using an autoconf ipv6 address to connect. This is a regression of v6 deployment in general, but brings us into a clean, known-good baseline state where we no longer have to deal with traffic from autoconf-style v6 addresses in any access/firewall rules.
  4. Going forward, for hosts where we need to add IPv6 without reinstalling, we'll need a consistent manual method of applying the same work as d-i, such as a one-off script that can be run to write the translated address to /e/n/i and bring it up for the first time (and add v6 DNS records for that host at that time as well).

The only additional complication that's come to my mind is that this does not work out for the LVS hosts. I think they need an on-subnet IPv6 address (of any kind, doesn't matter if autoconf) on all of their per-vlan interfaces in order to route IPv6 traffic correctly, and they're currently relying on autoconf for that. We can fix this by defining explicit, manual addresses for them via interface::tagged and adding those to DNS as well.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+12 -377
operations/puppetproduction+5 -0
operations/puppetproduction+3 -0
operations/puppetproduction+1 -0
operations/puppetproduction+2 -0
operations/puppetproduction+4 -0
operations/puppetproduction+24 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+3 -0
operations/puppetproduction+3 -0
operations/puppetproduction+23 -0
operations/puppetproduction+5 -0
operations/puppetproduction+6 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+2 -0
operations/puppetproduction+5 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+4 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+3 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+3 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+4 -0
operations/puppetproduction+1 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+13 -0
operations/puppetproduction+6 -0
operations/puppetproduction+3 -0
operations/puppetproduction+3 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+5 -0
operations/puppetproduction+2 -0
operations/puppetproduction+16 -0
operations/puppetproduction+1 -0
operations/puppetproduction+4 -0
operations/puppetproduction+3 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+3 -0
operations/puppetproduction+3 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+3 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+3 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+3 -5
operations/puppetproduction+1 -0
operations/puppetproduction+3 -0
operations/dnsmaster+6 -0
operations/puppetproduction+12 -15
operations/puppetproduction+4 -0
operations/puppetproduction+1 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 531266 merged by Jbond:
[operations/puppet@production] puppetboard/puppetdb: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531266

Change 531262 merged by Jbond:
[operations/puppet@production] mariadb::parsercache - codfw: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531262

Change 531164 merged by Jbond:
[operations/puppet@production] mariadb::core_multiinstance - codfw: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531164

Change 531453 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] MW servers - eqiad: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531453

Change 531217 merged by Jbond:
[operations/puppet@production] mariadb::temporary_storage: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531217

Change 531209 merged by Jbond:
[operations/puppet@production] mariadb::proxy - codfw: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531209

Change 531256 merged by Jbond:
[operations/puppet@production] MW servers - eqiad (canary and debug): add ipv6 mapped address

https://gerrit.wikimedia.org/r/531256

Change 531272 merged by Jbond:
[operations/puppet@production] restbase: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531272

Change 531271 merged by Jbond:
[operations/puppet@production] elasticsearch::relforge: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531271

Change 531203 merged by Jbond:
[operations/puppet@production] mariadb::misc::tendril: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531203

Change 531215 merged by Jbond:
[operations/puppet@production] elasticsearch::cirrus - codfw: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531215

Change 531216 merged by Jbond:
[operations/puppet@production] elasticsearch::cirrus - eqiad: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531216

Change 531195 merged by Jbond:
[operations/puppet@production] mariadb::misc::phabricator - codfw: add ipv6 address

https://gerrit.wikimedia.org/r/531195

Change 531255 merged by Jbond:
[operations/puppet@production] mw servers - codfw: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531255

Change 531264 abandoned by Jbond:
prometheus: add ipv6 mapped address

Reason:
not required

https://gerrit.wikimedia.org/r/531264

Change 531244 abandoned by Jbond:
lvs::balancer: add ipv6 mapped address

Reason:
not required

https://gerrit.wikimedia.org/r/531244

Change 531237 abandoned by Jbond:
installserver: add ipv6 mapped address

Reason:
not required

https://gerrit.wikimedia.org/r/531237

Change 531243 merged by Jbond:
[operations/puppet@production] logstash: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531243

Change 531251 merged by Jbond:
[operations/puppet@production] swift - codfw: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531251

Change 531252 merged by Jbond:
[operations/puppet@production] swift - eqiad: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531252

Change 531280 merged by Jbond:
[operations/puppet@production] wqds: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531280

Change 531453 merged by Jbond:
[operations/puppet@production] MW servers - eqiad: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531453

Change 531230 merged by Jbond:
[operations/puppet@production] grafana: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531230

Change 531231 merged by Jbond:
[operations/puppet@production] debug_proxy: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531231

Change 531245 merged by Jbond:
[operations/puppet@production] maps - codfw: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531245

Change 531246 merged by Jbond:
[operations/puppet@production] maps - eqiad: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531246

Change 531236 merged by Jbond:
[operations/puppet@production] graphite: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531236

Change 531239 merged by Jbond:
[operations/puppet@production] webserver_misc_apps: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531239

Change 531242 merged by Jbond:
[operations/puppet@production] openldap: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531242

Change 531258 merged by Jbond:
[operations/puppet@production] swap: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531258

Change 531275 merged by Jbond:
[operations/puppet@production] sessionstore: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531275

Change 531278 merged by Jbond:
[operations/puppet@production] thumbor: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531278

Change 531225 merged by Jbond:
[operations/puppet@production] etherpad: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531225

Change 531157 abandoned by Jbond:
spare::system: add ipv6 mapped addres

Reason:
not required

https://gerrit.wikimedia.org/r/531157

Change 531257 merged by Jbond:
[operations/puppet@production] logging::mediawiki::udp2log: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531257

Change 531207 merged by Jbond:
[operations/puppet@production] mariadb::backups - codfw: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531207

Change 531166 merged by Jbond:
[operations/puppet@production] role::mariadb::misc - codfw: add ipv6 mapped

https://gerrit.wikimedia.org/r/531166

Change 531200 merged by Jbond:
[operations/puppet@production] mariadb::misc::multiinstance: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531200

Change 531267 merged by Jbond:
[operations/puppet@production] redis - codfw: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531267

Change 531268 merged by Jbond:
[operations/puppet@production] redis - eqiad: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531268

Change 531281 merged by Jbond:
[operations/puppet@production] parsoid: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531281

Change 531250 merged by Jbond:
[operations/puppet@production] otrs: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531250

Change 531279 merged by Jbond:
[operations/puppet@production] xhgui::app: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531279

Change 531269 merged by Jbond:
[operations/puppet@production] docker_registry: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531269

Change 531265 merged by Jbond:
[operations/puppet@production] poolcounter: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531265

Change 531274 merged by Jbond:
[operations/puppet@production] eventschemas: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531274

Change 531238 merged by Jbond:
[operations/puppet@production] kafka::main: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531238

Change 531260 merged by Jbond:
[operations/puppet@production] ores: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531260

Change 531224 merged by Jbond:
[operations/puppet@production] etcd::networking: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531224

Change 531222 merged by Jbond:
[operations/puppet@production] etcd::kubernetes: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531222

Change 531247 merged by Jbond:
[operations/puppet@production] mediawiki::memcached - codfw: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531247

Change 531248 merged by Jbond:
[operations/puppet@production] mediawiki::memcached - eqiad: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531248

Change 531218 merged by Jbond:
[operations/puppet@production] failoid: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531218

Change 531205 merged by Jbond:
[operations/puppet@production] mariadb::dbstore_multiinstance - codfw: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531205

Change 531201 merged by Jbond:
[operations/puppet@production] mariadb::sanitarium_multiinstance: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531201

Change 531196 merged by Jbond:
[operations/puppet@production] mariadb::misc::phabricator - eqiad: add ipv6 address

https://gerrit.wikimedia.org/r/531196

Change 531208 merged by Jbond:
[operations/puppet@production] mariadb::backups - eqiad: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531208

Change 531206 merged by Jbond:
[operations/puppet@production] mariadb::dbstore_multiinstance - eqiad: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531206

Change 531167 merged by Jbond:
[operations/puppet@production] role::mariadb::misc - eqiad: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531167

Change 531161 merged by Jbond:
[operations/puppet@production] mariadb::core - codfw: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531161

Hi, I am bit disconnected about the planning of deployment of this- Once all hosts (or all hosts that are planned above being migrated, is the puppet line supposed to go on the profile (or role) or on base.pp with some exclussions? It is not clear based on the ticket description and comments, or I may have missed it as it is a long ticket :-D.

Change 531210 merged by Jbond:
[operations/puppet@production] mariadb::proxy - eqiad: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531210

Hi, I am bit disconnected about the planning of deployment of this- Once all hosts (or all hosts that are planned above being migrated, is the puppet line supposed to go on the profile (or role) or on base.pp with some exclussions? It is not clear based on the ticket description and comments, or I may have missed it as it is a long ticket :-D.

Sorry for the lack of clarity, once all servers have the mapped ipv6 address i plan to move this to the base profile with some logic to exclude the wmcs servers

Change 531263 merged by Jbond:
[operations/puppet@production] mariadb::parsercache - eqiad: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531263

Change 531173 merged by Jbond:
[operations/puppet@production] mariadb::core_multiinstance - eqiad: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531173

Change 531233 merged by Jbond:
[operations/puppet@production] backup::offsite: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531233

Sorry for the lack of clarity, once all servers have the mapped ipv6 address i plan to move this to the base profile with some logic to exclude the wmcs servers

Thanks!

Change 531174 merged by Jbond:
[operations/puppet@production] mariadb::core - eqiad: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531174

Change 531240 merged by Jbond:
[operations/puppet@production] labs::db: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531240

Change 531235 merged by Jbond:
[operations/puppet@production] wmcs::openstack::codfw1dev: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531235

Change 531227 merged by Jbond:
[operations/puppet@production] ganeti: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531227

Change 531241 merged by Jbond:
[operations/puppet@production] wmcs::nfs: add ipv6 mapped address

https://gerrit.wikimedia.org/r/531241

I just reimaged mw2231 for unrelated reasons (broken hardware, system got swapped with a different server) and the reimage hung. I connected over the mgmt and ran puppet manually over the serial console and it was stuck in the ifup which enables the mapped address (which was enabled for the mw servers end of August). We're also seeing that error when the mapped address is en-abled retroactively, so maybe we need to adapt the retry/timeout logic in the reimage script (or fix the puppet logic which makes it hang). Just adding this as a one time observation here, we'll see whether it also occurs for other reimages.

Change 535529 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] ip6_mapped: add missing nodes

https://gerrit.wikimedia.org/r/535529

Change 535529 merged by Jbond:
[operations/puppet@production] ip6_mapped: add missing nodes

https://gerrit.wikimedia.org/r/535529

Change 535544 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] ip6_mapped: add ip6_mapped to profile::standard

https://gerrit.wikimedia.org/r/535544

Change 535544 merged by Jbond:
[operations/puppet@production] ip6_mapped: add ip6_mapped to profile::standard

https://gerrit.wikimedia.org/r/535544

Bump - these issues continue to affect us sometimes. There seem to be some cases where Juniper can mis-route an RA to an interface it doesn't belong on (interface is on vlanX, but gets an RA that should only ever be seen on vlanY). During this past week/weekend's switch issues in codfw, this issue caused all hosts in rack B2 (which are all in the private1-b-codfw vlan 2620:0:860:102:) to receive RAs from the cloud-hosts1-codfw vlan 2620:0:860:118:.

Once this happens, the RA causes a new wrong IPv6 address to appear at the top of the list for the primary interface, and then the next puppet agent run's facter picks this up as the new @ipaddress6 fact, and further we then write it via Augeas into /etc/network/interfaces (that code is there for other good reasons, in the existing legacy design of things!), persisting it for future reboots as a statically-configured address.

I audited the whole fleet this morning to find evidence of any other extant cases (where a single real interface has multiple routeable IPv6 in distinct vlan subnets, basically), and found two more in eqiad:

bblack@lvs1017:~$ ip -6 addr show dev eno1np0|grep 2620:
inet6 2620:0:861:104:10:64:0:80/64 scope global deprecated dynamic mngtmpaddr 
inet6 2620:0:861:101:10:64:0:80/64 scope global 
bblack@db1129:~$ ip -6 addr show dev eno1|grep 2620:
inet6 2620:0:861:104:10:64:0:99/64 scope global 
inet6 2620:0:861:101:10:64:0:99/64 scope global

lvs1017 hasn't been rebooted since this happened to it, so its errant :104: address is still marked as dynamic. But in the db1129 case, it has rebooted since and statically configured it from the line added to /e/n/i.

(I'm going to fix up these two and the B2 hosts today for now, but it's a one-off fix, not a longer-term solution).

I fixed all these cases noted above for now. Note that in the lvs1017 case, this could've potentially caused a public service outage for IPv6 text-lb. This is because @ipaddress6 was also templated into pybal.conf as the BGP next-hop address. After the fixup, the puppet agent fixed that:

Notice: /Stage[main]/Pybal::Configuration/File[/etc/pybal/pybal.conf]/content: 
--- /etc/pybal/pybal.conf	2022-12-26 14:09:46.755325625 +0000
+++ /tmp/puppet-file20230118-4001-fj19ss	2023-01-18 19:47:03.448134870 +0000
@@ -7,7 +7,7 @@
 bgp-peer-address = [ '208.80.154.196', '208.80.154.197' ]
 #bgp-as-path = 64600 64601
 bgp-nexthop-ipv4 = 10.64.0.80
-bgp-nexthop-ipv6 = 2620:0:861:104:10:64:0:80
+bgp-nexthop-ipv6 = 2620:0:861:101:10:64:0:80

We escaped the outage this would cause by blind luck: nobody's happened to restart pybal on lvs1017 since this problem started on this host ~3 weeks ago (I'm guessing from the RA lifetimes remaining). On the next host reboot or pybal restart, it would've hit us.

jbond removed jbond as the assignee of this task.Feb 22 2023, 1:03 PM

In the medium term I think we need to carefully consider how this operates, probably as part of a move away from using ifupdown as discussed in T234207, and driving all interface configuration (including things like LVS additional and sub-interfaces) from Netbox.

Once this happens, the RA causes a new wrong IPv6 address to appear at the top of the list for the primary interface

For this specific problem I wonder if it may make sense to toggle these sysctl's to disable RA processing on non-primary interfaces?

net.ipv6.conf.all.accept_ra = 1
net.ipv6.conf.<non_primary_netdev>.accept_ra = 0