Page MenuHomePhabricator

eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup
Closed, ResolvedPublic

Description

For T341060: openstack eqiad1: introduce cloud-private and cloudlb this task track the work to rerack / reimage / rename cloudcontrol1005 https://netbox.wikimedia.org/dcim/devices/2613/

From:

  • cloudcontrol1005.wikimedia.org @ rack eqiad C5 and connected to asw switch

To:

  • cloudcontrol1005.eqiad.wmnet @ rack eqiad C8 and connected to cloudsw

Procedure: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging

  • decomission
  • re-rack into eqiad C8
  • netbox: edit the device name, and set its status from DECOMMISSIONING to PLANNED.
  • readd the DNS Name field for the management interface
  • run sre.dns.netbox cookbook
  • run sre.network.configure-switch-interfaces cookbook
  • reimage server with new name
  • verify services are mostly in good shape

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
aborrero triaged this task as Medium priority.Jul 10 2023, 4:38 PM
aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)
aborrero renamed this task from eqiad1: cloudlb project: reimage cloudcontrol1005 into new network setup to eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup.Jul 14 2023, 12:25 PM

Change 938235 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] eqiad1: decomission cloudcontrol1005.wikimedia.org

https://gerrit.wikimedia.org/r/938235

Change 938235 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] eqiad1: decomission cloudcontrol1005.wikimedia.org

https://gerrit.wikimedia.org/r/938235

Change 938831 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: nova fullstack: updated harcoded access to the list of controllers

https://gerrit.wikimedia.org/r/938831

Change 938831 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: nova fullstack: updated harcoded access to the list of controllers

https://gerrit.wikimedia.org/r/938831

cookbooks.sre.hosts.decommission executed by aborrero@cumin1001 for hosts: cloudcontrol1005.wikimedia.org

  • cloudcontrol1005.wikimedia.org (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
aborrero updated Other Assignee, added: Jclark-ctr.
aborrero moved this task from Doing to Radar on the User-aborrero board.
aborrero added a subscriber: Jclark-ctr.

Could you please @Jclark-ctr re-rack this server into C8, connect to cloudsw and leave it ready for reimage?

Mentioned in SAL (#wikimedia-cloud) [2023-07-17T15:55:41Z] <arturo> cloudcontrol1005 was shutdown earlier today (T341495)

aborrero added a project: ops-eqiad.
aborrero updated Other Assignee, removed: Jclark-ctr.

hey @Jclark-ctr if you have more than one cloud-related tasks to do on-site, please give highest priority to this one. Thanks!

aborrero removed a project: ops-eqiad.

the DC-ops part is done.

Change 940194 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudcontrol1005: add role with new domain

https://gerrit.wikimedia.org/r/940194

hey @cmooney when you have a moment could you please check switch port configuration for this host?

I think something somewhere things this should go into the public vlan:

aborrero@cumin1001:~ $ sudo cookbook sre.network.configure-switch-interfaces cloudcontrol1005
START - Cookbook sre.network.configure-switch-interfaces for host cloudcontrol1005
Network device returned invalid data: "<generator object ClusterShellWorker.get_results at 0x7f5066e4b890>". Error: 'NoneType' object is not subscriptable
----- OUTPUT of 'configure exclus...re;rollback;exit' -----
Entering configuration mode
warning: statement not found
[edit interfaces]
+   ge-0/0/4 {
+       description "cloudcontrol1005 {#20200233292}";
+       mtu 9192;
+       unit 0 {
+           family ethernet-switching {
+               interface-mode access;
+               vlan {
+                   members public1-c-eqiad;
+               }
+           }
+       }
+   }
load complete
Exiting configuration mode
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'configure exclus...re;rollback;exit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
==> Commit the above change?
Type "go" to proceed or "abort" to interrupt the execution
> abort
User input is: "abort"
Exception raised while executing cookbook sre.network.configure-switch-interfaces:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 212, in run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/network/configure-switch-interfaces.py", line 61, in run
    configure_switch_interfaces(self.remote, self.netbox, self.netbox_data, self.verbose)
  File "/srv/deployment/spicerack/cookbooks/sre/network/__init__.py", line 51, in configure_switch_interfaces
    run_junos_commands(remote_host, commands)
  File "/srv/deployment/spicerack/cookbooks/sre/network/__init__.py", line 175, in run_junos_commands
    ask_confirmation('Commit the above change?')
  File "/usr/lib/python3/dist-packages/wmflib/interactive.py", line 137, in ask_confirmation
    raise AbortError('Confirmation manually aborted')
wmflib.interactive.AbortError: Confirmation manually aborted
END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudcontrol1005

Nevermind @cmooney this looks better now after a few changes on netbox side:

aborrero@cumin1001:~$ sudo cookbook sre.network.configure-switch-interfaces cloudcontrol1005
START - Cookbook sre.network.configure-switch-interfaces for host cloudcontrol1005
Network device returned invalid data: "<generator object ClusterShellWorker.get_results at 0x7f21517e6890>". Error: 'NoneType' object is not subscriptable
----- OUTPUT of 'configure exclus...re;rollback;exit' -----
Entering configuration mode
warning: statement not found
[edit interfaces]
+   ge-0/0/4 {
+       description cloudcontrol1005;
+       mtu 9192;
+       unit 0 {
+           family ethernet-switching {
+               interface-mode access;
+               vlan {
+                   members cloud-hosts1-c8-eqiad;
+               }
+           }
+       }
+   }
load complete
Exiting configuration mode
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'configure exclus...re;rollback;exit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
==> Commit the above change?
Type "go" to proceed or "abort" to interrupt the execution
> go
User input is: "go"
----- OUTPUT of 'configure exclus...confirmed 1;exit' -----
Entering configuration mode
warning: statement not found
[edit interfaces]
+   ge-0/0/4 {
+       description cloudcontrol1005;
+       mtu 9192;
+       unit 0 {
+           family ethernet-switching {
+               interface-mode access;
+               vlan {
+                   members cloud-hosts1-c8-eqiad;
+               }
+           }
+       }
+   }
configuration check succeeds
commit confirmed will be automatically rolled back in 1 minutes unless confirmed
commit complete
Exiting configuration mode
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'configure exclus...confirmed 1;exit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Commited the above change, needs to be confirmed
----- OUTPUT of 'configure;commit check;exit' -----
Entering configuration mode
configuration check succeeds
Exiting configuration mode
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'configure;commit check;exit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Change confirmed
END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcontrol1005

Change 940194 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudcontrol1005: add role with new domain

https://gerrit.wikimedia.org/r/940194

Change 940197 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] eqiad1: cloudcontrol1005: load cloud-private

https://gerrit.wikimedia.org/r/940197

Change 940197 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] eqiad1: cloudcontrol1005: load cloud-private

https://gerrit.wikimedia.org/r/940197

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcontrol1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye

fyi @aborrero I just now downtimed this host in icinga until the 31st due to getting a page. Feel free to delete the downtimes when the reimaging is done.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcontrol1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307201655_aborrero_2612776_cloudcontrol1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Change 940322 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: don't deploy haproxy to cloudcontrol1005

https://gerrit.wikimedia.org/r/940322

Change 940322 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: don't deploy haproxy to cloudcontrol1005

https://gerrit.wikimedia.org/r/940322

Change 940324 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: control: fix typo in cloud_private include

https://gerrit.wikimedia.org/r/940324

Change 940324 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: control: fix typo in cloud_private include

https://gerrit.wikimedia.org/r/940324

Mentioned in SAL (#wikimedia-operations) [2023-07-21T08:45:01Z] <aborrero@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudcontrol1005 - aborrero@cumin1001 - T341495"

Mentioned in SAL (#wikimedia-operations) [2023-07-21T08:45:45Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudcontrol1005 - aborrero@cumin1001 - T341495"

Mentioned in SAL (#wikimedia-operations) [2023-07-21T08:47:25Z] <aborrero@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudcontrol1005 - aborrero@cumin1001 - T341495"

Mentioned in SAL (#wikimedia-operations) [2023-07-21T08:53:46Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudcontrol1005 - aborrero@cumin1001 - T341495"

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcontrol1005 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run failed and the operator aborted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye

Change 940336 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: add initial settings for cloudcontrol1005 as functional node

https://gerrit.wikimedia.org/r/940336

Change 940336 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: add initial settings for cloudcontrol1005 as functional node

https://gerrit.wikimedia.org/r/940336

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye completed:

  • cloudcontrol1005 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307211016_aborrero_2822281_cloudcontrol1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change 940342 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: cloudcontrol1005: allow haproxy backend access by cloudlb nodes

https://gerrit.wikimedia.org/r/940342

Change 940342 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: cloudcontrol1005: allow haproxy backend access by cloudlb nodes

https://gerrit.wikimedia.org/r/940342

Change 940344 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] eqiad1: depool cloudcontrol1005

https://gerrit.wikimedia.org/r/940344

Change 940930 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] clouceph: mon: enable more client networks

https://gerrit.wikimedia.org/r/940930

Change 940930 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] clouceph: mon: enable more client networks

https://gerrit.wikimedia.org/r/940930

aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)

As of this writing, the node is healthy. It works as a backend to the new cloudlbs and haproxy only sees down the designate endpoints at cloudservices nodes (expected).

aborrero@cloudlb1001:~ $ /usr/local/lib/nagios/plugins/check_haproxy --check=someup
OK check_someup servers up 12 down 3:
designate-api_backend,cloudservices1004.wikimedia.org
designate-api_backend,cloudservices1005.wikimedia.org
designate-api_backend,BACKEND

aborrero@cloudlb1002:~ $ /usr/local/lib/nagios/plugins/check_haproxy --check=someup
OK check_someup servers up 12 down 3:
designate-api_backend,cloudservices1004.wikimedia.org
designate-api_backend,cloudservices1005.wikimedia.org
designate-api_backend,BACKEND

Change 942385 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/wmcs-cookbooks@main] cookbooks: remove references to cloudcontrol1005.wikimedia.org

https://gerrit.wikimedia.org/r/942385

Change 942385 merged by Arturo Borrero Gonzalez:

[cloud/wmcs-cookbooks@main] cookbooks: remove references to cloudcontrol1005.wikimedia.org

https://gerrit.wikimedia.org/r/942385

Change 947886 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] eqiad cloudceph config: add cloudcontrol1005.eqiad.wmnet

https://gerrit.wikimedia.org/r/947886

Change 947886 merged by Andrew Bogott:

[operations/puppet@production] eqiad cloudceph config: add cloudcontrol1005.eqiad.wmnet

https://gerrit.wikimedia.org/r/947886