cloudservices1005: move to new setup
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Sep 11 2023, 2:37 PM

Description

The cloudcontrol1005 is moving to a new network setup.

We should:

drop wikimedia.org domain in favor of .eqiad.wmnet.
drop connection to asw
add private.eqiad.wikimedia.cloud address

Following procedure at https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging

relocate DNS auth address to offer service while in downtime
run sre.hosts.decommission cookbook
re-rack into eqiad D5 per T341494: cloud @ eqiad: hardware re-racking plan
adjust netbox data (per procedure above) and run the cookbooks
adjust new puppet role (mind the new domain)
reimage as new host
Run instructions to put into service: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/DNS/Designate#Initial_designate/pdns_node_setup

Details

Subject	Repo	Branch	Lines +/-
cloudservices1006: stop announcing ns0.openstack.eqiad1.wikimediacloud.org	operations/puppet	production	+0 -7
openstack: eqiad1: services: enable cloud-private for cloudservices1005	operations/puppet	production	+4 -12
openstack: eqiad1: pdns: recursor: fix list of pdns hosts	operations/puppet	production	+1 -18
cloudservices1005: prepare for reimage and back into service	operations/puppet	production	+36 -45
openstack: eqiad1: pdns: refactor monitor checks	operations/puppet	production	+33 -42
openstack: remove overrides for designate_hosts	operations/puppet	production	+3 -17
openstack: drop references to cloudcontrol1005	operations/puppet	production	+0 -8
wikimediacloud.org: drop 208.80.154.148 from ns0.openstack	operations/dns	master	+0 -2
cloudservices1006: make pdns auth listen on the new ns0.openstack address	operations/puppet	production	+8 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	aborrero	T296411 cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet
Resolved	aborrero	T297596 have cloud hardware servers in the cloud realm using a dedicated LB layer
Resolved	taavi	T341060 openstack eqiad1: introduce cloud-private and cloudlb
Resolved	Jclark-ctr	T341494 cloud @ eqiad: hardware re-racking plan
Resolved	Jclark-ctr	T346042 cloudservices1005: move to new setup
Resolved	aborrero	T346177 Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org
Resolved	RobH	T346326 markmonitor update: refresh ns0.openstack.eqiad1.wikimediacloud.org glue A record to point to 185.15.56.162
Resolved	fnegri	T346385 cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

taavi added a subtask: T346177: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org.Sep 12 2023, 6:13 PM

aborrero removed a subtask: T346177: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org.Sep 13 2023, 10:25 AM

aborrero added a subtask: T346177: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org.Sep 13 2023, 10:31 AM

so here is the plan:

we will let the markmonitor updates in T346177: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org get applied, for both ns1 and ns0 (as of this writing, we are requesting ns0 to point to 208.80.154.148).
we will route 208.80.154.148 to cloduservices1006 and NAT the address in that box to 185.15.56.163 so the pdns_server can handle the requests
this will get both ns0 (208.80.154.148) and ns1 (185.15.56.163) running and serving clients, in a somewhat stable fashion
we will be then decom and re-rack cloudservices1005 in the datacenter
when cloudservices1005 is back online, in the new network setup, it should listen on 185.15.56.162
we will request markmonitor to update ns0 to point to 185.15.56.162

As of this writing, points 1 to 3 are being implemented by @cmooney and I.

aborrero mentioned this in T346177: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org.Sep 13 2023, 12:46 PM

Update: the plan outlined above may not work after all. We just did step 2 and things broke, notably the resolver stopped working.

We are investigating.

To clarify: we won't be shutting down cloudservices1005 today.

@aborrero thanks for updated please advise when we are good to proceed

Thanks @aborrero, that plan is how I expected based on our chat earlier.

Re step 2, we should retry when we think we've ironed out our niggles. Re-routing the IP to cloudservices1006, and reverting if needs be, is probably better than running decom for cloudservices1005 and trying to roll back.

@taavi wrote:

   NEXT STEPS

I think the steps to complete this migration without any further user impact are roughly the following:
1. Make cloudservices1006 also answer queries for 185.15.56.162 (new ns0).
2. Update both the wikimediacloud.org zone file and the .org glue records to reference .162 as the ns0 record.
3. Wait for all of the DNS TTLs for ns0 to expire.
4. Revert the routing hacks for 208.80.154.11. Also remove the Netbox record for it.
5. Move cloudservices1005 to the cloudlb network setup.
6. Move .162 (new ns0) from cloudservices1006 to 1005.

@taavi @aborrero that's not a bad plan of action at all.

In terms of step 4 I'm not sure we need to hold off, but in general there is no rush so that's fine.

What's the idea in terms of making cloudservices1006 respond for 185.15.56.162? Do we need to do something similar to what was done for 208.80.154.11? Or can you add that to the 'lo' interface on the box, get pdns to listen on it and BIRD to announce in BGP?

Change 957722 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices1006: make pdns auth listen on the new ns0.openstack address

https://gerrit.wikimedia.org/r/957722

gerritbot added a project: Patch-For-Review.Sep 14 2023, 11:46 AM

aborrero closed subtask T346177: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org as Resolved.Sep 14 2023, 11:48 AM

Change 957722 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudservices1006: make pdns auth listen on the new ns0.openstack address

https://gerrit.wikimedia.org/r/957722

Maintenance_bot removed a project: Patch-For-Review.Sep 14 2023, 12:10 PM

Mentioned in SAL (#wikimedia-cloud) [2023-09-14T12:11:12Z] <arturo> enable puppet on cloudservices1006 to drop local NAT hacks and enable new DNS auth IP address (T346042)

cloudservices1006 is now replying to DNS auth queries in the 185.15.56.162 address, which will later be handed to cloudservices1005:

arturo@nostromo:~ $ dig @185.15.56.162 k8s.svc.tools.eqiad1.wikimedia.cloud

; <<>> DiG 9.18.16-1-Debian <<>> @185.15.56.162 k8s.svc.tools.eqiad1.wikimedia.cloud
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13962
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;k8s.svc.tools.eqiad1.wikimedia.cloud. IN A

;; ANSWER SECTION:
k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A	172.16.6.113

;; Query time: 116 msec
;; SERVER: 185.15.56.162#53(185.15.56.162) (UDP)
;; WHEN: Thu Sep 14 14:44:22 CEST 2023
;; MSG SIZE  rcvd: 81

arturo@nostromo:~ $ dig @185.15.56.162 toolforge.org

; <<>> DiG 9.18.16-1-Debian <<>> @185.15.56.162 toolforge.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8114
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;toolforge.org.			IN	A

;; ANSWER SECTION:
toolforge.org.		3600	IN	A	185.15.56.11

;; Query time: 108 msec
;; SERVER: 185.15.56.162#53(185.15.56.162) (UDP)
;; WHEN: Thu Sep 14 14:44:28 CEST 2023
;; MSG SIZE  rcvd: 58

aborrero updated the task description. (Show Details)Sep 14 2023, 12:48 PM

Mentioned in SAL (#wikimedia-cloud) [2023-09-14T14:42:05Z] <arturo> DNS operation: route 208.80.154.148 to cloudservices1006 in anticipation of cloudservices1005 decom (T346042)

Icinga downtime and Alertmanager silence (ID=fd552e4c-12f5-4380-9775-a70e560609fd) set by cmooney@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: test before full decom

cloudservices1005.wikimedia.org

Change 957767 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimediacloud.org: drop 208.80.154.148 from ns0.openstack

https://gerrit.wikimedia.org/r/957767

gerritbot added a project: Patch-For-Review.Sep 14 2023, 4:02 PM

Change 957767 abandoned by Cathal Mooney:

[operations/dns@master] wikimediacloud.org: drop 208.80.154.148 from ns0.openstack

Reason:

Parent change to remove ns-recursor0 IP is not something we need to do before this, submitting separate CR just for this.

https://gerrit.wikimedia.org/r/957767

Maintenance_bot removed a project: Patch-For-Review.Sep 14 2023, 4:31 PM

cmooney closed subtask T346326: markmonitor update: refresh ns0.openstack.eqiad1.wikimediacloud.org glue A record to point to 185.15.56.162 as Resolved.Sep 14 2023, 5:15 PM

cmooney added a subtask: T346385: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005.Sep 14 2023, 6:24 PM

Update on current progress on above steps:

~~1. Make cloudservices1006 also answer queries for 185.15.56.162 (new ns0).~~
~~2. Update both the wikimediacloud.org zone file and the .org glue records to reference .162 as the ns0 record.~~
~~3. Wait for all of the DNS TTLs for ns0 to expire.~~
~~4. Revert the routing hacks for 208.80.154.11. Also remove the Netbox record for it.~~

Move cloudservices1005 to the cloudlb network setup.
Move .162 (new ns0) from cloudservices1006 to 1005.

I added sub-task T346385 based on some learnings we had earlier that we need to factor in to remaining steps 5 and 6

cmooney updated the task description. (Show Details)Sep 14 2023, 6:58 PM

@aborrero most (if not all) of the Toolforge tools are throwing 504's (T346126). Seems related to this. Is there any way we can fix this?

cmooney added a subtask: T346426: Some VPS instances still using ns-recursor0.Sep 15 2023, 8:30 AM

In T346042#9168336, @Brycehughes wrote:

@aborrero most (if not all) of the Toolforge tools are throwing 504's (T346126). Seems related to this. Is there any way we can fix this?

not related.

hey @Jclark-ctr we are now ready to do this migration next week. Starting next monday 2023-09-18, when could you handle this?

Again, my plan is to shutdown the server before you are arrive in the DC.

aborrero closed subtask T346426: Some VPS instances still using ns-recursor0 as Resolved.Sep 15 2023, 12:57 PM

Brycehughes unsubscribed.Sep 16 2023, 12:46 AM

aborrero moved this task from Doing to Next on the User-aborrero board.Sep 18 2023, 8:50 AM

aborrero moved this task from Next to Doing on the User-aborrero board.

aborrero moved this task from Doing to Next on the User-aborrero board.

Mentioned in SAL (#wikimedia-cloud) [2023-09-18T11:40:52Z] <arturo> decomission cloudservices1005 T346042 in preparation for re-racking

cookbooks.sre.hosts.decommission executed by aborrero@cumin1001 for hosts: cloudservices1005.wikimedia.org

cloudservices1005.wikimedia.org (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

aborrero updated the task description. (Show Details)Sep 18 2023, 11:50 AM

Mentioned in SAL (#wikimedia-cloud) [2023-09-18T11:53:39Z] <taavi> update designate urls in Keystone to point to openstack-next, until cloudlb is serving the main openstack address T346042

Change 958465 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: drop references to cloudcontrol1005

https://gerrit.wikimedia.org/r/958465

Change 958465 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: drop references to cloudcontrol1005

https://gerrit.wikimedia.org/r/958465

Maintenance_bot removed a project: Patch-For-Review.Sep 18 2023, 12:10 PM

Change 958467 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: remove overrides for designate_hosts

https://gerrit.wikimedia.org/r/958467

Change 958467 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: remove overrides for designate_hosts

https://gerrit.wikimedia.org/r/958467

Maintenance_bot removed a project: Patch-For-Review.Sep 18 2023, 12:30 PM

hey @Jclark-ctr (or @VRiley-WMF) this server should be ready to be re-racked into rack D5.

@aborrero. I have moved server physically and in netbox. i did not delete any interfaces out of netbox new Cableid. 20220117 port 41 on cloudsw1-d5-eqiad

Jclark-ctr updated the task description. (Show Details)Sep 18 2023, 2:07 PM

Change 958904 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: pdns: refactor monitor checks

https://gerrit.wikimedia.org/r/958904

gerritbot added a project: Patch-For-Review.Sep 19 2023, 10:45 AM

Change 958904 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: pdns: refactor monitor checks

https://gerrit.wikimedia.org/r/958904

Maintenance_bot removed a project: Patch-For-Review.Sep 19 2023, 11:31 AM

Change 958915 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices1005: prepare for reimage and back into service

https://gerrit.wikimedia.org/r/958915

gerritbot added a project: Patch-For-Review.Sep 19 2023, 11:36 AM

I misclicked on netbox and deleted the whole device entry for cloudservices1005, meaning it is no longer registered on netbox. Transaction changelog: https://netbox.wikimedia.org/extras/changelog/?request_id=2c28eb69-41e4-4907-b745-f4b2113cb369

I'm currently trying to rollback/undo/etc.

aborrero moved this task from Next to Doing on the User-aborrero board.Sep 19 2023, 4:01 PM

Change 958915 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudservices1005: prepare for reimage and back into service

https://gerrit.wikimedia.org/r/958915

Maintenance_bot removed a project: Patch-For-Review.Sep 20 2023, 8:32 AM

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye executed with errors:

cloudservices1005 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye

aborrero updated the task description. (Show Details)Sep 20 2023, 8:58 AM

Change 959168 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: pdns: recursor: fix list of pdns hosts

https://gerrit.wikimedia.org/r/959168

gerritbot added a project: Patch-For-Review.Sep 20 2023, 9:04 AM

Change 959168 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: pdns: recursor: fix list of pdns hosts

https://gerrit.wikimedia.org/r/959168

Maintenance_bot removed a project: Patch-For-Review.Sep 20 2023, 9:11 AM

Change 959171 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: services: enable cloud-private for cloudservices1005

https://gerrit.wikimedia.org/r/959171

gerritbot added a project: Patch-For-Review.Sep 20 2023, 9:21 AM

Change 959171 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: services: enable cloud-private for cloudservices1005

https://gerrit.wikimedia.org/r/959171

Maintenance_bot removed a project: Patch-For-Review.Sep 20 2023, 9:32 AM

Mentioned in SAL (#wikimedia-operations) [2023-09-20T09:34:24Z] <aborrero@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudservices1005 - aborrero@cumin1001 - T346042"

Mentioned in SAL (#wikimedia-operations) [2023-09-20T09:35:11Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudservices1005 - aborrero@cumin1001 - T346042"

Mentioned in SAL (#wikimedia-operations) [2023-09-20T09:38:46Z] <aborrero@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudservices1005 - aborrero@cumin1001 - T346042"

Mentioned in SAL (#wikimedia-operations) [2023-09-20T09:39:46Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudservices1005 - aborrero@cumin1001 - T346042"

Change 959175 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices1006: stop announcing ns0.openstack.eqiad1.wikimediacloud.org

https://gerrit.wikimedia.org/r/959175

gerritbot added a project: Patch-For-Review.Sep 20 2023, 9:55 AM

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye completed:

cloudservices1005 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309200913_aborrero_18549_cloudservices1005.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

Mentioned in SAL (#wikimedia-cloud) [2023-09-20T10:06:03Z] <arturo> running SQL command update domains set master="185.15.56.162:5354 185.15.56.163:5354" on cloudservices1005/1005 (T346042)

Mentioned in SAL (#wikimedia-cloud) [2023-09-20T10:28:20Z] <arturo> running SQL command update domains set master="172.20.1.5:5354 172.20.2.4:5354 185.15.56.162:5354 185.15.56.163:5354"; on cloudservices1005/1006 (T346042)

aborrero updated the task description. (Show Details)Sep 20 2023, 10:34 AM

Change 959175 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudservices1006: stop announcing ns0.openstack.eqiad1.wikimediacloud.org

https://gerrit.wikimedia.org/r/959175

aborrero closed this task as Resolved.Sep 20 2023, 10:56 AM

Maintenance_bot removed a project: Patch-For-Review.Sep 20 2023, 11:10 AM

aborrero mentioned this in T341060: openstack eqiad1: introduce cloud-private and cloudlb.Sep 20 2023, 12:01 PM

aborrero removed a subtask: T346426: Some VPS instances still using ns-recursor0.Sep 20 2023, 12:25 PM

fnegri moved this task from Backlog to Done on the cloud-services-team (FY2023/2024-Q1-Q2) board.Oct 3 2023, 6:00 PM

fnegri closed subtask T346385: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 as Resolved.Nov 10 2023, 4:58 PM