eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Jul 10 2023, 4:38 PM

Description

For T341060: openstack eqiad1: introduce cloud-private and cloudlb this task track the work to rerack / reimage / rename cloudcontrol1005 https://netbox.wikimedia.org/dcim/devices/2613/

From:

cloudcontrol1005.wikimedia.org @ rack eqiad C5 and connected to asw switch

To:

cloudcontrol1005.eqiad.wmnet @ rack eqiad C8 and connected to cloudsw

Procedure: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging

decomission
re-rack into eqiad C8
netbox: edit the device name, and set its status from DECOMMISSIONING to PLANNED.
readd the DNS Name field for the management interface
run sre.dns.netbox cookbook
run sre.network.configure-switch-interfaces cookbook
reimage server with new name
verify services are mostly in good shape

Details

Subject	Repo	Branch	Lines +/-
eqiad cloudceph config: add cloudcontrol1005.eqiad.wmnet	operations/puppet	production	+1 -0
cookbooks: remove references to cloudcontrol1005.wikimedia.org	cloud/wmcs-cookbooks	main	+7 -7
clouceph: mon: enable more client networks	operations/puppet	production	+9 -6
eqiad1: depool cloudcontrol1005	operations/puppet	production	+2 -1
openstack: cloudcontrol1005: allow haproxy backend access by cloudlb nodes	operations/puppet	production	+7 -0
openstack: eqiad1: add initial settings for cloudcontrol1005 as functional node	operations/puppet	production	+2 -3
openstack: eqiad1: control: fix typo in cloud_private include	operations/puppet	production	+1 -1
openstack: eqiad1: don't deploy haproxy to cloudcontrol1005	operations/puppet	production	+7 -2
eqiad1: cloudcontrol1005: load cloud-private	operations/puppet	production	+6 -0
cloudcontrol1005: add role with new domain	operations/puppet	production	+5 -1
openstack: nova fullstack: updated harcoded access to the list of controllers	operations/puppet	production	+3 -3
eqiad1: decomission cloudcontrol1005.wikimedia.org	operations/puppet	production	+6 -14

Related Objects
Search...

Status	Assigned	Task
Resolved	aborrero	T296411 cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet
Resolved	aborrero	T297596 have cloud hardware servers in the cloud realm using a dedicated LB layer
Resolved	taavi	T341060 openstack eqiad1: introduce cloud-private and cloudlb
Resolved	aborrero	T341495 eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

aborrero triaged this task as Medium priority.Jul 10 2023, 4:38 PM

aborrero updated the task description. (Show Details)

aborrero mentioned this in T341060: openstack eqiad1: introduce cloud-private and cloudlb.Jul 10 2023, 4:44 PM

aborrero mentioned this in T341494: cloud @ eqiad: hardware re-racking plan.Jul 11 2023, 8:46 AM

aborrero updated the task description. (Show Details)Jul 11 2023, 10:33 AM

aborrero renamed this task from eqiad1: cloudlb project: reimage cloudcontrol1005 into new network setup to eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup.Jul 14 2023, 12:25 PM

Change 938235 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] eqiad1: decomission cloudcontrol1005.wikimedia.org

https://gerrit.wikimedia.org/r/938235

gerritbot added a project: Patch-For-Review.Jul 14 2023, 12:41 PM

aborrero moved this task from Backlog to Doing on the User-aborrero board.Jul 17 2023, 10:54 AM

Change 938235 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] eqiad1: decomission cloudcontrol1005.wikimedia.org

https://gerrit.wikimedia.org/r/938235

aborrero updated the task description. (Show Details)Jul 17 2023, 11:14 AM

Change 938831 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: nova fullstack: updated harcoded access to the list of controllers

https://gerrit.wikimedia.org/r/938831

Change 938831 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: nova fullstack: updated harcoded access to the list of controllers

https://gerrit.wikimedia.org/r/938831

Maintenance_bot removed a project: Patch-For-Review.Jul 17 2023, 11:30 AM

cookbooks.sre.hosts.decommission executed by aborrero@cumin1001 for hosts: cloudcontrol1005.wikimedia.org

cloudcontrol1005.wikimedia.org (WARN)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Management interface not found on Icinga, unable to downtime it
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

Could you please @Jclark-ctr re-rack this server into C8, connect to cloudsw and leave it ready for reimage?

aborrero updated the task description. (Show Details)Jul 17 2023, 12:15 PM

Mentioned in SAL (#wikimedia-cloud) [2023-07-17T15:55:41Z] <arturo> cloudcontrol1005 was shutdown earlier today (T341495)

aborrero reassigned this task from aborrero to Jclark-ctr.Jul 18 2023, 12:03 PM

aborrero added a project: ops-eqiad.

aborrero updated Other Assignee, removed: Jclark-ctr.

Maintenance_bot added a project: SRE.Jul 18 2023, 12:29 PM

hey @Jclark-ctr if you have more than one cloud-related tasks to do on-site, please give highest priority to this one. Thanks!

Jclark-ctr updated the task description. (Show Details)Jul 19 2023, 2:38 PM

Jclark-ctr updated the task description. (Show Details)Jul 19 2023, 2:55 PM

Jclark-ctr moved this task from Backlog to Blocked on the ops-eqiad board.Jul 19 2023, 3:28 PM

aborrero moved this task from Blocked to Next on the User-aborrero board.Jul 20 2023, 9:19 AM

the DC-ops part is done.

aborrero removed a project: SRE.Jul 20 2023, 2:46 PM

aborrero moved this task from Next to Doing on the User-aborrero board.Jul 20 2023, 3:55 PM

Change 940194 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudcontrol1005: add role with new domain

https://gerrit.wikimedia.org/r/940194

gerritbot added a project: Patch-For-Review.Jul 20 2023, 4:00 PM

hey @cmooney when you have a moment could you please check switch port configuration for this host?

I think something somewhere things this should go into the public vlan:

aborrero@cumin1001:~ $ sudo cookbook sre.network.configure-switch-interfaces cloudcontrol1005
START - Cookbook sre.network.configure-switch-interfaces for host cloudcontrol1005
Network device returned invalid data: "<generator object ClusterShellWorker.get_results at 0x7f5066e4b890>". Error: 'NoneType' object is not subscriptable
----- OUTPUT of 'configure exclus...re;rollback;exit' -----
Entering configuration mode
warning: statement not found
[edit interfaces]
+   ge-0/0/4 {
+       description "cloudcontrol1005 {#20200233292}";
+       mtu 9192;
+       unit 0 {
+           family ethernet-switching {
+               interface-mode access;
+               vlan {
+                   members public1-c-eqiad;
+               }
+           }
+       }
+   }
load complete
Exiting configuration mode
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'configure exclus...re;rollback;exit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
==> Commit the above change?
Type "go" to proceed or "abort" to interrupt the execution
> abort
User input is: "abort"
Exception raised while executing cookbook sre.network.configure-switch-interfaces:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 212, in run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/network/configure-switch-interfaces.py", line 61, in run
    configure_switch_interfaces(self.remote, self.netbox, self.netbox_data, self.verbose)
  File "/srv/deployment/spicerack/cookbooks/sre/network/__init__.py", line 51, in configure_switch_interfaces
    run_junos_commands(remote_host, commands)
  File "/srv/deployment/spicerack/cookbooks/sre/network/__init__.py", line 175, in run_junos_commands
    ask_confirmation('Commit the above change?')
  File "/usr/lib/python3/dist-packages/wmflib/interactive.py", line 137, in ask_confirmation
    raise AbortError('Confirmation manually aborted')
wmflib.interactive.AbortError: Confirmation manually aborted
END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudcontrol1005

Nevermind @cmooney this looks better now after a few changes on netbox side:

aborrero@cumin1001:~$ sudo cookbook sre.network.configure-switch-interfaces cloudcontrol1005
START - Cookbook sre.network.configure-switch-interfaces for host cloudcontrol1005
Network device returned invalid data: "<generator object ClusterShellWorker.get_results at 0x7f21517e6890>". Error: 'NoneType' object is not subscriptable
----- OUTPUT of 'configure exclus...re;rollback;exit' -----
Entering configuration mode
warning: statement not found
[edit interfaces]
+   ge-0/0/4 {
+       description cloudcontrol1005;
+       mtu 9192;
+       unit 0 {
+           family ethernet-switching {
+               interface-mode access;
+               vlan {
+                   members cloud-hosts1-c8-eqiad;
+               }
+           }
+       }
+   }
load complete
Exiting configuration mode
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'configure exclus...re;rollback;exit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
==> Commit the above change?
Type "go" to proceed or "abort" to interrupt the execution
> go
User input is: "go"
----- OUTPUT of 'configure exclus...confirmed 1;exit' -----
Entering configuration mode
warning: statement not found
[edit interfaces]
+   ge-0/0/4 {
+       description cloudcontrol1005;
+       mtu 9192;
+       unit 0 {
+           family ethernet-switching {
+               interface-mode access;
+               vlan {
+                   members cloud-hosts1-c8-eqiad;
+               }
+           }
+       }
+   }
configuration check succeeds
commit confirmed will be automatically rolled back in 1 minutes unless confirmed
commit complete
Exiting configuration mode
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'configure exclus...confirmed 1;exit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Commited the above change, needs to be confirmed
----- OUTPUT of 'configure;commit check;exit' -----
Entering configuration mode
configuration check succeeds
Exiting configuration mode
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'configure;commit check;exit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Change confirmed
END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcontrol1005

Change 940194 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudcontrol1005: add role with new domain

https://gerrit.wikimedia.org/r/940194

Change 940197 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] eqiad1: cloudcontrol1005: load cloud-private

https://gerrit.wikimedia.org/r/940197

aborrero updated the task description. (Show Details)Jul 20 2023, 4:30 PM

Change 940197 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] eqiad1: cloudcontrol1005: load cloud-private

https://gerrit.wikimedia.org/r/940197

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye executed with errors:

cloudcontrol1005 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye

Maintenance_bot removed a project: Patch-For-Review.Jul 20 2023, 5:10 PM

fyi @aborrero I just now downtimed this host in icinga until the 31st due to getting a page. Feel free to delete the downtimes when the reimaging is done.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye executed with errors:

cloudcontrol1005 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307201655_aborrero_2612776_cloudcontrol1005.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details

Change 940322 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: don't deploy haproxy to cloudcontrol1005

https://gerrit.wikimedia.org/r/940322

Change 940322 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: don't deploy haproxy to cloudcontrol1005

https://gerrit.wikimedia.org/r/940322

Maintenance_bot removed a project: Patch-For-Review.Jul 21 2023, 8:31 AM

Change 940324 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: control: fix typo in cloud_private include

https://gerrit.wikimedia.org/r/940324

Change 940324 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: control: fix typo in cloud_private include

https://gerrit.wikimedia.org/r/940324

Mentioned in SAL (#wikimedia-operations) [2023-07-21T08:45:01Z] <aborrero@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudcontrol1005 - aborrero@cumin1001 - T341495"

Mentioned in SAL (#wikimedia-operations) [2023-07-21T08:45:45Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudcontrol1005 - aborrero@cumin1001 - T341495"

Mentioned in SAL (#wikimedia-operations) [2023-07-21T08:47:25Z] <aborrero@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudcontrol1005 - aborrero@cumin1001 - T341495"

Mentioned in SAL (#wikimedia-operations) [2023-07-21T08:53:46Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudcontrol1005 - aborrero@cumin1001 - T341495"

Maintenance_bot removed a project: Patch-For-Review.Jul 21 2023, 9:10 AM

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye executed with errors:

cloudcontrol1005 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run failed and the operator aborted
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye

Change 940336 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: add initial settings for cloudcontrol1005 as functional node

https://gerrit.wikimedia.org/r/940336

gerritbot added a project: Patch-For-Review.Jul 21 2023, 11:05 AM

Change 940336 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: add initial settings for cloudcontrol1005 as functional node

https://gerrit.wikimedia.org/r/940336

Maintenance_bot removed a project: Patch-For-Review.Jul 21 2023, 11:10 AM

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye completed:

cloudcontrol1005 (WARN)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307211016_aborrero_2822281_cloudcontrol1005.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Change 940342 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: cloudcontrol1005: allow haproxy backend access by cloudlb nodes

https://gerrit.wikimedia.org/r/940342

gerritbot added a project: Patch-For-Review.Jul 21 2023, 11:58 AM

Change 940342 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: cloudcontrol1005: allow haproxy backend access by cloudlb nodes

https://gerrit.wikimedia.org/r/940342

Change 940344 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] eqiad1: depool cloudcontrol1005

https://gerrit.wikimedia.org/r/940344

Change 940930 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] clouceph: mon: enable more client networks

https://gerrit.wikimedia.org/r/940930

Change 940930 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] clouceph: mon: enable more client networks

https://gerrit.wikimedia.org/r/940930

aborrero updated the task description. (Show Details)Jul 24 2023, 12:37 PM

aborrero updated the task description. (Show Details)

As of this writing, the node is healthy. It works as a backend to the new cloudlbs and haproxy only sees down the designate endpoints at cloudservices nodes (expected).

aborrero@cloudlb1001:~ $ /usr/local/lib/nagios/plugins/check_haproxy --check=someup
OK check_someup servers up 12 down 3:
designate-api_backend,cloudservices1004.wikimedia.org
designate-api_backend,cloudservices1005.wikimedia.org
designate-api_backend,BACKEND

aborrero@cloudlb1002:~ $ /usr/local/lib/nagios/plugins/check_haproxy --check=someup
OK check_someup servers up 12 down 3:
designate-api_backend,cloudservices1004.wikimedia.org
designate-api_backend,cloudservices1005.wikimedia.org
designate-api_backend,BACKEND

Change 942385 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/wmcs-cookbooks@main] cookbooks: remove references to cloudcontrol1005.wikimedia.org

https://gerrit.wikimedia.org/r/942385

Change 942385 merged by Arturo Borrero Gonzalez:

[cloud/wmcs-cookbooks@main] cookbooks: remove references to cloudcontrol1005.wikimedia.org

https://gerrit.wikimedia.org/r/942385

aborrero mentioned this in rCCKB4ee1132f05af: cookbooks: remove references to cloudcontrol1005.wikimedia.org.Jul 27 2023, 11:14 AM

fnegri moved this task from Backlog to Done on the cloud-services-team (FY2022/2023-Q4) board.Jul 27 2023, 3:13 PM

Change 947886 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] eqiad cloudceph config: add cloudcontrol1005.eqiad.wmnet

https://gerrit.wikimedia.org/r/947886

Change 947886 merged by Andrew Bogott:

[operations/puppet@production] eqiad cloudceph config: add cloudcontrol1005.eqiad.wmnet

https://gerrit.wikimedia.org/r/947886

eqiad1: cloudlb: reimage cloudcontrol1005 into new network setupClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup
Closed, ResolvedPublic
Actions

Related Objects
Search...