Page MenuHomePhabricator

cloudservices1005: move to new setup
Closed, ResolvedPublic

Description

The cloudcontrol1005 is moving to a new network setup.

We should:

  • drop wikimedia.org domain in favor of .eqiad.wmnet.
  • drop connection to asw
  • add private.eqiad.wikimedia.cloud address

Following procedure at https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

so here is the plan:

  1. we will let the markmonitor updates in T346177: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org get applied, for both ns1 and ns0 (as of this writing, we are requesting ns0 to point to 208.80.154.148).
  2. we will route 208.80.154.148 to cloduservices1006 and NAT the address in that box to 185.15.56.163 so the pdns_server can handle the requests
  3. this will get both ns0 (208.80.154.148) and ns1 (185.15.56.163) running and serving clients, in a somewhat stable fashion
  4. we will be then decom and re-rack cloudservices1005 in the datacenter
  5. when cloudservices1005 is back online, in the new network setup, it should listen on 185.15.56.162
  6. we will request markmonitor to update ns0 to point to 185.15.56.162

As of this writing, points 1 to 3 are being implemented by @cmooney and I.

Update: the plan outlined above may not work after all. We just did step 2 and things broke, notably the resolver stopped working.

We are investigating.

To clarify: we won't be shutting down cloudservices1005 today.

@aborrero thanks for updated please advise when we are good to proceed

Thanks @aborrero, that plan is how I expected based on our chat earlier.

Re step 2, we should retry when we think we've ironed out our niggles. Re-routing the IP to cloudservices1006, and reverting if needs be, is probably better than running decom for cloudservices1005 and trying to roll back.

@taavi wrote:

   NEXT STEPS

I think the steps to complete this migration without any further user impact are roughly the following:
1. Make cloudservices1006 also answer queries for 185.15.56.162 (new ns0).
2. Update both the wikimediacloud.org zone file and the .org glue records to reference .162 as the ns0 record.
3. Wait for all of the DNS TTLs for ns0 to expire.
4. Revert the routing hacks for 208.80.154.11. Also remove the Netbox record for it.
5. Move cloudservices1005 to the cloudlb network setup.
6. Move .162 (new ns0) from cloudservices1006 to 1005.

@taavi @aborrero that's not a bad plan of action at all.

In terms of step 4 I'm not sure we need to hold off, but in general there is no rush so that's fine.

What's the idea in terms of making cloudservices1006 respond for 185.15.56.162? Do we need to do something similar to what was done for 208.80.154.11? Or can you add that to the 'lo' interface on the box, get pdns to listen on it and BIRD to announce in BGP?

Change 957722 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices1006: make pdns auth listen on the new ns0.openstack address

https://gerrit.wikimedia.org/r/957722

Change 957722 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudservices1006: make pdns auth listen on the new ns0.openstack address

https://gerrit.wikimedia.org/r/957722

Mentioned in SAL (#wikimedia-cloud) [2023-09-14T12:11:12Z] <arturo> enable puppet on cloudservices1006 to drop local NAT hacks and enable new DNS auth IP address (T346042)

cloudservices1006 is now replying to DNS auth queries in the 185.15.56.162 address, which will later be handed to cloudservices1005:

arturo@nostromo:~ $ dig @185.15.56.162 k8s.svc.tools.eqiad1.wikimedia.cloud

; <<>> DiG 9.18.16-1-Debian <<>> @185.15.56.162 k8s.svc.tools.eqiad1.wikimedia.cloud
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13962
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;k8s.svc.tools.eqiad1.wikimedia.cloud. IN A

;; ANSWER SECTION:
k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A	172.16.6.113

;; Query time: 116 msec
;; SERVER: 185.15.56.162#53(185.15.56.162) (UDP)
;; WHEN: Thu Sep 14 14:44:22 CEST 2023
;; MSG SIZE  rcvd: 81

arturo@nostromo:~ $ dig @185.15.56.162 toolforge.org

; <<>> DiG 9.18.16-1-Debian <<>> @185.15.56.162 toolforge.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8114
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;toolforge.org.			IN	A

;; ANSWER SECTION:
toolforge.org.		3600	IN	A	185.15.56.11

;; Query time: 108 msec
;; SERVER: 185.15.56.162#53(185.15.56.162) (UDP)
;; WHEN: Thu Sep 14 14:44:28 CEST 2023
;; MSG SIZE  rcvd: 58

Mentioned in SAL (#wikimedia-cloud) [2023-09-14T14:42:05Z] <arturo> DNS operation: route 208.80.154.148 to cloudservices1006 in anticipation of cloudservices1005 decom (T346042)

Icinga downtime and Alertmanager silence (ID=fd552e4c-12f5-4380-9775-a70e560609fd) set by cmooney@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: test before full decom

cloudservices1005.wikimedia.org

Change 957767 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimediacloud.org: drop 208.80.154.148 from ns0.openstack

https://gerrit.wikimedia.org/r/957767

Change 957767 abandoned by Cathal Mooney:

[operations/dns@master] wikimediacloud.org: drop 208.80.154.148 from ns0.openstack

Reason:

Parent change to remove ns-recursor0 IP is not something we need to do before this, submitting separate CR just for this.

https://gerrit.wikimedia.org/r/957767

Update on current progress on above steps:

1. Make cloudservices1006 also answer queries for 185.15.56.162 (new ns0).
2. Update both the wikimediacloud.org zone file and the .org glue records to reference .162 as the ns0 record.
3. Wait for all of the DNS TTLs for ns0 to expire.
4. Revert the routing hacks for 208.80.154.11. Also remove the Netbox record for it.

  1. Move cloudservices1005 to the cloudlb network setup.
  2. Move .162 (new ns0) from cloudservices1006 to 1005.

I added sub-task T346385 based on some learnings we had earlier that we need to factor in to remaining steps 5 and 6

@aborrero most (if not all) of the Toolforge tools are throwing 504's (T346126). Seems related to this. Is there any way we can fix this?

@aborrero most (if not all) of the Toolforge tools are throwing 504's (T346126). Seems related to this. Is there any way we can fix this?

not related.

hey @Jclark-ctr we are now ready to do this migration next week. Starting next monday 2023-09-18, when could you handle this?

Again, my plan is to shutdown the server before you are arrive in the DC.

aborrero moved this task from Next to Doing on the User-aborrero board.
aborrero moved this task from Doing to Next on the User-aborrero board.

Mentioned in SAL (#wikimedia-cloud) [2023-09-18T11:40:52Z] <arturo> decomission cloudservices1005 T346042 in preparation for re-racking

cookbooks.sre.hosts.decommission executed by aborrero@cumin1001 for hosts: cloudservices1005.wikimedia.org

  • cloudservices1005.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Mentioned in SAL (#wikimedia-cloud) [2023-09-18T11:53:39Z] <taavi> update designate urls in Keystone to point to openstack-next, until cloudlb is serving the main openstack address T346042

Change 958465 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: drop references to cloudcontrol1005

https://gerrit.wikimedia.org/r/958465

Change 958465 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: drop references to cloudcontrol1005

https://gerrit.wikimedia.org/r/958465

Change 958467 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: remove overrides for designate_hosts

https://gerrit.wikimedia.org/r/958467

Change 958467 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: remove overrides for designate_hosts

https://gerrit.wikimedia.org/r/958467

hey @Jclark-ctr (or @VRiley-WMF) this server should be ready to be re-racked into rack D5.

@aborrero. I have moved server physically and in netbox. i did not delete any interfaces out of netbox new Cableid. 20220117 port 41 on cloudsw1-d5-eqiad

Change 958904 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: pdns: refactor monitor checks

https://gerrit.wikimedia.org/r/958904

Change 958904 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: pdns: refactor monitor checks

https://gerrit.wikimedia.org/r/958904

Change 958915 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices1005: prepare for reimage and back into service

https://gerrit.wikimedia.org/r/958915

I misclicked on netbox and deleted the whole device entry for cloudservices1005, meaning it is no longer registered on netbox. Transaction changelog: https://netbox.wikimedia.org/extras/changelog/?request_id=2c28eb69-41e4-4907-b745-f4b2113cb369

I'm currently trying to rollback/undo/etc.

Change 958915 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudservices1005: prepare for reimage and back into service

https://gerrit.wikimedia.org/r/958915

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye executed with errors:

  • cloudservices1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye

Change 959168 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: pdns: recursor: fix list of pdns hosts

https://gerrit.wikimedia.org/r/959168

Change 959168 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: pdns: recursor: fix list of pdns hosts

https://gerrit.wikimedia.org/r/959168

Change 959171 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: services: enable cloud-private for cloudservices1005

https://gerrit.wikimedia.org/r/959171

Change 959171 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: services: enable cloud-private for cloudservices1005

https://gerrit.wikimedia.org/r/959171

Mentioned in SAL (#wikimedia-operations) [2023-09-20T09:34:24Z] <aborrero@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudservices1005 - aborrero@cumin1001 - T346042"

Mentioned in SAL (#wikimedia-operations) [2023-09-20T09:35:11Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudservices1005 - aborrero@cumin1001 - T346042"

Mentioned in SAL (#wikimedia-operations) [2023-09-20T09:38:46Z] <aborrero@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudservices1005 - aborrero@cumin1001 - T346042"

Mentioned in SAL (#wikimedia-operations) [2023-09-20T09:39:46Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudservices1005 - aborrero@cumin1001 - T346042"

Change 959175 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices1006: stop announcing ns0.openstack.eqiad1.wikimediacloud.org

https://gerrit.wikimedia.org/r/959175

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye completed:

  • cloudservices1005 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309200913_aborrero_18549_cloudservices1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Mentioned in SAL (#wikimedia-cloud) [2023-09-20T10:06:03Z] <arturo> running SQL command update domains set master="185.15.56.162:5354 185.15.56.163:5354" on cloudservices1005/1005 (T346042)

Mentioned in SAL (#wikimedia-cloud) [2023-09-20T10:28:20Z] <arturo> running SQL command update domains set master="172.20.1.5:5354 172.20.2.4:5354 185.15.56.162:5354 185.15.56.163:5354"; on cloudservices1005/1006 (T346042)

Change 959175 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudservices1006: stop announcing ns0.openstack.eqiad1.wikimediacloud.org

https://gerrit.wikimedia.org/r/959175