Page MenuHomePhabricator

eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup
Closed, ResolvedPublic

Description

For T341060: openstack eqiad1: introduce cloud-private and cloudlb this task track the work to rerack / reimage / rename cloudcontrol1005 https://netbox.wikimedia.org/dcim/devices/2613/

From:

  • cloudcontrol1005.wikimedia.org @ rack eqiad C5 and connected to asw switch

To:

  • cloudcontrol1005.eqiad.wmnet @ rack eqiad C8 and connected to cloudsw

Procedure: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging

  • decomission
  • re-rack into eqiad C8
  • netbox: edit the device name, and set its status from DECOMMISSIONING to PLANNED.
  • readd the DNS Name field for the management interface
  • run sre.dns.netbox cookbook
  • run sre.network.configure-switch-interfaces cookbook
  • reimage server with new name
  • verify services are mostly in good shape

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)
aborrero renamed this task from eqiad1: cloudlb project: reimage cloudcontrol1005 into new network setup to eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup.Jul 14 2023, 12:25 PM

Change 938235 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] eqiad1: decomission cloudcontrol1005.wikimedia.org

https://gerrit.wikimedia.org/r/938235

Change 938235 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] eqiad1: decomission cloudcontrol1005.wikimedia.org

https://gerrit.wikimedia.org/r/938235

Change 938831 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: nova fullstack: updated harcoded access to the list of controllers

https://gerrit.wikimedia.org/r/938831

Change 938831 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: nova fullstack: updated harcoded access to the list of controllers

https://gerrit.wikimedia.org/r/938831

cookbooks.sre.hosts.decommission executed by aborrero@cumin1001 for hosts: cloudcontrol1005.wikimedia.org

  • cloudcontrol1005.wikimedia.org (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
aborrero updated Other Assignee, added: Jclark-ctr.
aborrero moved this task from Doing to Radar/observer on the User-aborrero board.
aborrero added a subscriber: Jclark-ctr.

Could you please @Jclark-ctr re-rack this server into C8, connect to cloudsw and leave it ready for reimage?

Mentioned in SAL (#wikimedia-cloud) [2023-07-17T15:55:41Z] <arturo> cloudcontrol1005 was shutdown earlier today (T341495)

aborrero added a project: ops-eqiad.
aborrero updated Other Assignee, removed: Jclark-ctr.

hey @Jclark-ctr if you have more than one cloud-related tasks to do on-site, please give highest priority to this one. Thanks!

aborrero removed a project: ops-eqiad.

the DC-ops part is done.

Change 940194 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudcontrol1005: add role with new domain

https://gerrit.wikimedia.org/r/940194

hey @cmooney when you have a moment could you please check switch port configuration for this host?

I think something somewhere things this should go into the public vlan:

aborrero@cumin1001:~ $ sudo cookbook sre.network.configure-switch-interfaces cloudcontrol1005
START - Cookbook sre.network.configure-switch-interfaces for host cloudcontrol1005
Network device returned invalid data: "<generator object ClusterShellWorker.get_results at 0x7f5066e4b890>". Error: 'NoneType' object is not subscriptable
----- OUTPUT of 'configure exclus...re;rollback;exit' -----
Entering configuration mode
warning: statement not found
[edit interfaces]
+   ge-0/0/4 {
+       description "cloudcontrol1005 {#20200233292}";
+       mtu 9192;
+       unit 0 {
+           family ethernet-switching {
+               interface-mode access;
+               vlan {
+                   members public1-c-eqiad;
+               }
+           }
+       }
+   }
load complete
Exiting configuration mode
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'configure exclus...re;rollback;exit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
==> Commit the above change?
Type "go" to proceed or "abort" to interrupt the execution
> abort
User input is: "abort"
Exception raised while executing cookbook sre.network.configure-switch-interfaces:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 212, in run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/network/configure-switch-interfaces.py", line 61, in run
    configure_switch_interfaces(self.remote, self.netbox, self.netbox_data, self.verbose)
  File "/srv/deployment/spicerack/cookbooks/sre/network/__init__.py", line 51, in configure_switch_interfaces
    run_junos_commands(remote_host, commands)
  File "/srv/deployment/spicerack/cookbooks/sre/network/__init__.py", line 175, in run_junos_commands
    ask_confirmation('Commit the above change?')
  File "/usr/lib/python3/dist-packages/wmflib/interactive.py", line 137, in ask_confirmation
    raise AbortError('Confirmation manually aborted')
wmflib.interactive.AbortError: Confirmation manually aborted
END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudcontrol1005

Nevermind @cmooney this looks better now after a few changes on netbox side:

aborrero@cumin1001:~$ sudo cookbook sre.network.configure-switch-interfaces cloudcontrol1005
START - Cookbook sre.network.configure-switch-interfaces for host cloudcontrol1005
Network device returned invalid data: "<generator object ClusterShellWorker.get_results at 0x7f21517e6890>". Error: 'NoneType' object is not subscriptable
----- OUTPUT of 'configure exclus...re;rollback;exit' -----
Entering configuration mode
warning: statement not found
[edit interfaces]
+   ge-0/0/4 {
+       description cloudcontrol1005;
+       mtu 9192;
+       unit 0 {
+           family ethernet-switching {
+               interface-mode access;
+               vlan {
+                   members cloud-hosts1-c8-eqiad;
+               }
+           }
+       }
+   }
load complete
Exiting configuration mode
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'configure exclus...re;rollback;exit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
==> Commit the above change?
Type "go" to proceed or "abort" to interrupt the execution
> go
User input is: "go"
----- OUTPUT of 'configure exclus...confirmed 1;exit' -----
Entering configuration mode
warning: statement not found
[edit interfaces]
+   ge-0/0/4 {
+       description cloudcontrol1005;
+       mtu 9192;
+       unit 0 {
+           family ethernet-switching {
+               interface-mode access;
+               vlan {
+                   members cloud-hosts1-c8-eqiad;
+               }
+           }
+       }
+   }
configuration check succeeds
commit confirmed will be automatically rolled back in 1 minutes unless confirmed
commit complete
Exiting configuration mode
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'configure exclus...confirmed 1;exit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Commited the above change, needs to be confirmed
----- OUTPUT of 'configure;commit check;exit' -----
Entering configuration mode
configuration check succeeds
Exiting configuration mode
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'configure;commit check;exit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Change confirmed
END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcontrol1005

Change 940194 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudcontrol1005: add role with new domain

https://gerrit.wikimedia.org/r/940194

Change 940197 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] eqiad1: cloudcontrol1005: load cloud-private

https://gerrit.wikimedia.org/r/940197

Change 940197 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] eqiad1: cloudcontrol1005: load cloud-private

https://gerrit.wikimedia.org/r/940197

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcontrol1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye

fyi @aborrero I just now downtimed this host in icinga until the 31st due to getting a page. Feel free to delete the downtimes when the reimaging is done.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcontrol1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307201655_aborrero_2612776_cloudcontrol1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Change 940322 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: don't deploy haproxy to cloudcontrol1005

https://gerrit.wikimedia.org/r/940322

Change 940322 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: don't deploy haproxy to cloudcontrol1005

https://gerrit.wikimedia.org/r/940322

Change 940324 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: control: fix typo in cloud_private include

https://gerrit.wikimedia.org/r/940324

Change 940324 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: control: fix typo in cloud_private include

https://gerrit.wikimedia.org/r/940324

Mentioned in SAL (#wikimedia-operations) [2023-07-21T08:45:01Z] <aborrero@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudcontrol1005 - aborrero@cumin1001 - T341495"

Mentioned in SAL (#wikimedia-operations) [2023-07-21T08:45:45Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudcontrol1005 - aborrero@cumin1001 - T341495"

Mentioned in SAL (#wikimedia-operations) [2023-07-21T08:47:25Z] <aborrero@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudcontrol1005 - aborrero@cumin1001 - T341495"

Mentioned in SAL (#wikimedia-operations) [2023-07-21T08:53:46Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudcontrol1005 - aborrero@cumin1001 - T341495"

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcontrol1005 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run failed and the operator aborted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye

Change 940336 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: add initial settings for cloudcontrol1005 as functional node

https://gerrit.wikimedia.org/r/940336

Change 940336 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: add initial settings for cloudcontrol1005 as functional node

https://gerrit.wikimedia.org/r/940336

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bullseye completed:

  • cloudcontrol1005 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307211016_aborrero_2822281_cloudcontrol1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change 940342 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: cloudcontrol1005: allow haproxy backend access by cloudlb nodes

https://gerrit.wikimedia.org/r/940342

Change 940342 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: cloudcontrol1005: allow haproxy backend access by cloudlb nodes

https://gerrit.wikimedia.org/r/940342

Change 940344 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] eqiad1: depool cloudcontrol1005

https://gerrit.wikimedia.org/r/940344

Change 940930 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] clouceph: mon: enable more client networks

https://gerrit.wikimedia.org/r/940930

Change 940930 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] clouceph: mon: enable more client networks

https://gerrit.wikimedia.org/r/940930

aborrero updated the task description. (Show Details)

As of this writing, the node is healthy. It works as a backend to the new cloudlbs and haproxy only sees down the designate endpoints at cloudservices nodes (expected).

aborrero@cloudlb1001:~ $ /usr/local/lib/nagios/plugins/check_haproxy --check=someup
OK check_someup servers up 12 down 3:
designate-api_backend,cloudservices1004.wikimedia.org
designate-api_backend,cloudservices1005.wikimedia.org
designate-api_backend,BACKEND

aborrero@cloudlb1002:~ $ /usr/local/lib/nagios/plugins/check_haproxy --check=someup
OK check_someup servers up 12 down 3:
designate-api_backend,cloudservices1004.wikimedia.org
designate-api_backend,cloudservices1005.wikimedia.org
designate-api_backend,BACKEND

Change 942385 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/wmcs-cookbooks@main] cookbooks: remove references to cloudcontrol1005.wikimedia.org

https://gerrit.wikimedia.org/r/942385

Change 942385 merged by Arturo Borrero Gonzalez:

[cloud/wmcs-cookbooks@main] cookbooks: remove references to cloudcontrol1005.wikimedia.org

https://gerrit.wikimedia.org/r/942385

Change 947886 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] eqiad cloudceph config: add cloudcontrol1005.eqiad.wmnet

https://gerrit.wikimedia.org/r/947886

Change 947886 merged by Andrew Bogott:

[operations/puppet@production] eqiad cloudceph config: add cloudcontrol1005.eqiad.wmnet

https://gerrit.wikimedia.org/r/947886