Page MenuHomePhabricator

cloudcontrol2001-dev: make it a cloudlb backend
Closed, ResolvedPublic

Description

This task is to track the work to make cloudcontrol2001-dev a cloudlb backend.

Event Timeline

aborrero triaged this task as Medium priority.May 9 2023, 9:34 AM
aborrero created this task.

Change 899614 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudcontrol2001-dev: introduce cloudlb support

https://gerrit.wikimedia.org/r/899614

cookbooks.sre.hosts.decommission executed by aborrero@cumin2002 for hosts: cloudcontrol2001-dev.wikimedia.org

  • cloudcontrol2001-dev.wikimedia.org (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • Host steps raised exception: The request failed with code 500 Internal Server Error: {'error': "cannot import name 'domain' from 'validators' (unknown location)", 'exception': 'ImportError', 'netbox_version': '3.2.9', 'python_version': '3.9.2'}

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by aborrero@cumin2002 for hosts: cloudcontrol2001-dev.wikimedia.org

  • cloudcontrol2001-dev.wikimedia.org (FAIL)
    • Unable to find/resolve the mgmt DNS record, using the IP instead: 10.193.2.198
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Host is already powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

  • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)

Don't worry about this, is an artifact of the half-failed step in the previous run, the wipefs was run successfully:

2023-05-09 09:45:18,722 aborrero 2453874 [INFO actions.py:125 in _action] Wiped all swraid, partition-table and filesystem signatures
  • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)

Don't worry about this, is an artifact of the half-failed step in the previous run, the wipefs was run successfully:

Thanks!

hey @Papaul could you please physical connect this host to cloudsw1-b1-codfw instead of asw-b1-codfw?

https://netbox.wikimedia.org/dcim/devices/2067/

I guess that's needed before reimaging with the new address / role per https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging

Change 917866 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudcontrol2001-dev: drop references to the old FQDN

https://gerrit.wikimedia.org/r/917866

Change 917866 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudcontrol2001-dev: drop references to the old FQDN

https://gerrit.wikimedia.org/r/917866

Hostname is the same, but domain is changing, from cloudcontrol2001-dev.wikimedia.org to cloudcontrol2001-dev.codfw.wmnet, so also the IPv4 address from public to private.

@Papaul The port being used is xe-1/0/25

In case is relevant, before the decommission step, cable was #2023 switch asw-b1-codfw port ge-1/0/18.

Also, note there is some port information on T327919#8699523, not sure if relevant here though.

@aborrero the move from old switch to new switch is complete.

Change 899614 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudcontrol2001-dev: introduce cloudlb support

https://gerrit.wikimedia.org/r/899614

Change 917910 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimedia.cloud: add entry for cloudcontrol2001-dev

https://gerrit.wikimedia.org/r/917910

Change 917910 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wikimedia.cloud: add entry for cloudcontrol2001-dev

https://gerrit.wikimedia.org/r/917910

I got this @Papaul:

aborrero@cumin2002:~ 1 $ sudo cookbook sre.hosts.reimage --os bullseye --new -t T336236 cloudcontrol2001-dev
==> ATTENTION: destructive action for host: cloudcontrol2001-dev
Are you sure to proceed?
Type "go" to proceed or "abort" to interrupt the execution
> go
User input is: "go"
Exception raised while initializing the Cookbook sre.hosts.reimage:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 197, in run
    runner = self.instance.get_runner(args)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 88, in get_runner
    return ReimageRunner(args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 111, in __init__
    self.mgmt_fqdn = self.netbox_server.mgmt_fqdn
  File "/usr/lib/python3/dist-packages/spicerack/netbox.py", line 360, in mgmt_fqdn
    raise NetboxError(f"Server {self._server.name} has no management interface with a DNS name set.")
spicerack.netbox.NetboxError: Server cloudcontrol2001-dev has no management interface with a DNS name set.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudbackup2001-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudbackup2001-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudcontrol2001-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye

Change 917928 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Update cloudcontrol2001 in netboot.cfg file

https://gerrit.wikimedia.org/r/917928

Change 917928 merged by Papaul:

[operations/puppet@production] Update cloudcontrol2001 in netboot.cfg file

https://gerrit.wikimedia.org/r/917928

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudcontrol2001-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudcontrol2001-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudcontrol2001-dev (FAIL)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudcontrol2001-dev (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Change 917959 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add cloudcontrol2001-dev with role insetup

https://gerrit.wikimedia.org/r/917959

Change 917959 merged by Papaul:

[operations/puppet@production] Add cloudcontrol2001-dev with role insetup

https://gerrit.wikimedia.org/r/917959

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye completed:

  • cloudcontrol2001-dev (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305092232_pt1979_3351418_cloudcontrol2001-dev.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

@aborrero i went ahead and setup the node in site.pp with role::insetup::wmcs and re-image it so you can put the node in service.

Thanks

@aborrero i went ahead and setup the node in site.pp with role::insetup::wmcs and re-image it so you can put the node in service.

Thanks!

Change 918390 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudcontrol2001-dev: give it proper role

https://gerrit.wikimedia.org/r/918390

Change 918390 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudcontrol2001-dev: give it proper role

https://gerrit.wikimedia.org/r/918390

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye

Change 918414 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudlb: stronger openstack_controllers override

https://gerrit.wikimedia.org/r/918414

Change 918414 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudlb: stronger openstack_controllers override

https://gerrit.wikimedia.org/r/918414

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye completed:

  • cloudcontrol2001-dev (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305100909_aborrero_3712674_cloudcontrol2001-dev.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

As of today, cloudcontrol2001-dev.codfw.wmnet is a backend to cloudlb200X-dev.codfw.wmnet and they are communicating over the cloud-private vlan.

Change 919210 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/dns@master] update codfw1dev rabbitmq01 cname for new cloudcontrol2001-dev

https://gerrit.wikimedia.org/r/919210

Change 919210 merged by Andrew Bogott:

[operations/dns@master] update codfw1dev rabbitmq01 cname for new cloudcontrol2001-dev

https://gerrit.wikimedia.org/r/919210

Change 919217 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Temporarily mark out refs to cloudrabbit01 in codfw1dev

https://gerrit.wikimedia.org/r/919217

Change 919217 merged by Andrew Bogott:

[operations/puppet@production] Temporarily mark out refs to cloudrabbit01 in codfw1dev

https://gerrit.wikimedia.org/r/919217

hi, cloudcontrol2001-dev is failing to do all its backups. Usually this is due to maintenance or a defect on setup:

root@backup1001:~$ check_bacula.py cloudcontrol2001-dev.codfw.wmnet-Monthly-1st-Wed-productionEqiad-mysql-srv-backups
id: 509222, ts: 2023-05-11 04:11:31, type: F, status: f, bytes: 0
id: 509381, ts: 2023-05-12 04:11:28, type: F, status: f, bytes: 0
id: 509539, ts: 2023-05-13 04:11:10, type: F, status: f, bytes: 0
id: 509698, ts: 2023-05-14 04:11:14, type: F, status: f, bytes: 0
id: 509868, ts: 2023-05-15 04:14:10, type: F, status: f, bytes: 0
✔️

I can file this on a separate ticket, if needed.

I see what is going on- the backups are happening, but they return empty- which is a weird setup and we interpret as a failure (not intended).
We can do 2 things- either putting some files on /srv/backups (e.g. if dumping is not happening but it should) or ignore failures on this machine.

Change 919781 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Ignore cloudcontrol2001-dev backup errors (no file backed up)

https://gerrit.wikimedia.org/r/919781

I see what is going on- the backups are happening, but they return empty- which is a weird setup and we interpret as a failure (not intended).
We can do 2 things- either putting some files on /srv/backups (e.g. if dumping is not happening but it should) or ignore failures on this machine.

this is unexpected. I'll definitely try to understand why that happened.

It could be a mere race condition: the server was reimaged and the mysql backup file wasn't produced before bacule tried to back up it.

Change 919781 merged by Jcrespo:

[operations/puppet@production] bacula: Ignore cloudcontrol2001-dev backup errors (no file backed up)

https://gerrit.wikimedia.org/r/919781