Page MenuHomePhabricator

eqiad1: cloudlb: transition DNS clients (VMs) to the new BGP-based recursor VIP
Closed, ResolvedPublic

Description

In the legacy network model, each designate DNS recoursor would have its own public IPv4 address. In the new model, there is a single BGP anycast VIP shared between all the recursors. See also: T307357: Move cloud vps ns-recursor IPs to host/row-independent addressing

So as part of T341060: openstack eqiad1: introduce cloud-private and cloudlb we need to transition
from:

  • ns-recursor0.openstack.eqiad1.wikimediacloud.org
  • ns-recursor1.openstack.eqiad1.wikimediacloud.org

to:

  • ns-recursor.openstack.eqiad1.wikimediacloud.org

Note, we are using ns-recursor-next.openstack.eqiad1.wikimediacloud.org as placeholder while preparing the code changes, etc.

Related Objects

StatusSubtypeAssignedTask
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedayounsi
Resolvedcmooney
ResolvedPapaul
Resolvedcmooney
Resolvedcmooney
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedtaavi
Opencmooney
Resolvedaborrero
Opencmooney
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
OpenAndrew
OpenAndrew
ResolvedAndrew
Resolvedaborrero
In Progresstaavi
OpenNone
OpenNone
Resolvedaborrero
Resolvedcmooney
Resolvedaborrero
Invalidaborrero
Openaborrero

Event Timeline

Change 941383 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices1006: prepare service

https://gerrit.wikimedia.org/r/941383

aborrero changed the task status from Open to Stalled.Aug 29 2023, 12:39 PM
aborrero moved this task from Next to Blocked on the User-aborrero board.

blocked on T342161: Q1:rack/setup/install cloudservices1006.eqiad.wmnet we need the server to be available.

aborrero changed the task status from Stalled to In Progress.Aug 30 2023, 9:57 AM
aborrero moved this task from Blocked to Doing on the User-aborrero board.

Change 941383 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudservices1006: prepare service

https://gerrit.wikimedia.org/r/941383

Change 954605 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: drop -next suffix from ns-recursor

https://gerrit.wikimedia.org/r/954605

Change 954605 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: drop -next suffix from ns-recursor

https://gerrit.wikimedia.org/r/954605

Mentioned in SAL (#wikimedia-cloud) [2023-09-05T12:45:57Z] <arturo> moved all VMs to ns-recursor.openstack.eqiad1.wikimediacloud.org via project puppet (T345240, T342621)

Mentioned in SAL (#wikimedia-cloud) [2023-09-06T08:47:02Z] <arturo> switch project to new DNS recursor via horizon project hiera (T345240, T342621)

Change 956415 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wmcs: drop cloudservices1004 addresses

https://gerrit.wikimedia.org/r/956415

Change 956417 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: refresh cloudservices1006 ns address

https://gerrit.wikimedia.org/r/956417

Change 956419 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] wmcs: refresh DNS addresses

https://gerrit.wikimedia.org/r/956419

Change 956417 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: refresh cloudservices1006 ns address

https://gerrit.wikimedia.org/r/956417

Change 956419 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] wmcs: refresh DNS addresses

https://gerrit.wikimedia.org/r/956419

Change 956415 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wmcs: drop cloudservices1004 addresses

https://gerrit.wikimedia.org/r/956415

Mentioned in SAL (#wikimedia-cloud) [2023-09-11T12:36:34Z] <arturo> update DNS resolver cloud-wide to use 172.20.255.1 (T342621)

Change 956429 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/homer/public@master] cr-cloud: add ns-recursor.openstack.eqiad1

https://gerrit.wikimedia.org/r/956429

Change 956429 merged by Arturo Borrero Gonzalez:

[operations/homer/public@master] cr-cloud: add ns-recursor.openstack.eqiad1

https://gerrit.wikimedia.org/r/956429

Change 956463 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] wmcs: remove ns-recursorX FQDNs

https://gerrit.wikimedia.org/r/956463

Change 956463 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] wmcs: remove ns-recursorX FQDNs

https://gerrit.wikimedia.org/r/956463

Puppet is failing to run on a bunch of integration project nodes. For example, on integration-agent-docker-1052.integration.eqiad1.wikimedia.cloud:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Method call, DNS lookup failed for ns-recursor1.openstack.eqiad1.wikimediacloud.org Resolv::DNS::Resource::IN::A (file: /etc/puppet/modules/resolvconf/manifests/init.pp, line: 25, column: 34) on node integration-agent-docker-1052.integration.eqiad1.wikimedia.cloud

This seems related to this ticket.

Puppet is failing to run on a bunch of integration project nodes. For example, on integration-agent-docker-1052.integration.eqiad1.wikimedia.cloud:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Method call, DNS lookup failed for ns-recursor1.openstack.eqiad1.wikimediacloud.org Resolv::DNS::Resource::IN::A (file: /etc/puppet/modules/resolvconf/manifests/init.pp, line: 25, column: 34) on node integration-agent-docker-1052.integration.eqiad1.wikimedia.cloud

This seems related to this ticket.

integration-puppetmaster-02 has local cherry-picks that have been blocking the git-sync-upstream process since early July. Fixing Puppet git repository updates will resolve the Puppet failures you're seeing.

I removed the offending commit from integration-puppetmaster-02.integration.eqiad.wmflabs:/var/lib/git/operations/puppet. Thanks again @taavi for the pointer to the root problem.