cloudcontrol2001-dev: make it a cloudlb backend
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	May 9 2023, 9:34 AM

Description

This task is to track the work to make cloudcontrol2001-dev a cloudlb backend.

Details

Subject	Repo	Branch	Lines +/-
bacula: Ignore cloudcontrol2001-dev backup errors (no file backed up)	operations/puppet	production	+1 -0
Temporarily mark out refs to cloudrabbit01 in codfw1dev	operations/puppet	production	+2 -2
update codfw1dev rabbitmq01 cname for new cloudcontrol2001-dev	operations/dns	master	+1 -1
cloudlb: stronger openstack_controllers override	operations/puppet	production	+6 -7
cloudcontrol2001-dev: give it proper role	operations/puppet	production	+3 -1
Add cloudcontrol2001-dev with role insetup	operations/puppet	production	+5 -1
Update cloudcontrol2001 in netboot.cfg file	operations/puppet	production	+1 -1
wikimedia.cloud: add entry for cloudcontrol2001-dev	operations/dns	master	+5 -4
cloudcontrol2001-dev: introduce cloudlb support	operations/puppet	production	+37 -2
cloudcontrol2001-dev: drop references to the old FQDN	operations/puppet	production	+0 -6

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	aborrero	T296411 cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet
Resolved	aborrero	T297596 have cloud hardware servers in the cloud realm using a dedicated LB layer
Resolved	aborrero	T324992 cloudlb: create PoC on codfw
Resolved	aborrero	T332153 cloudlb PoC: prepare backends
Resolved	aborrero	T336236 cloudcontrol2001-dev: make it a cloudlb backend

Event Timeline

aborrero triaged this task as Medium priority.May 9 2023, 9:34 AM

aborrero created this task.

aborrero moved this task from Backlog to In progress on the cloud-services-team (FY2022/2023-Q4) board.

Change 899614 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudcontrol2001-dev: introduce cloudlb support

https://gerrit.wikimedia.org/r/899614

gerritbot added a project: Patch-For-Review.May 9 2023, 9:48 AM

cookbooks.sre.hosts.decommission executed by aborrero@cumin2002 for hosts: cloudcontrol2001-dev.wikimedia.org

cloudcontrol2001-dev.wikimedia.org (FAIL)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Management interface not found on Icinga, unable to downtime it
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Host steps raised exception: The request failed with code 500 Internal Server Error: {'error': "cannot import name 'domain' from 'validators' (unknown location)", 'exception': 'ImportError', 'netbox_version': '3.2.9', 'python_version': '3.9.2'}

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by aborrero@cumin2002 for hosts: cloudcontrol2001-dev.wikimedia.org

cloudcontrol2001-dev.wikimedia.org (FAIL)
- Unable to find/resolve the mgmt DNS record, using the IP instead: 10.193.2.198
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Management interface not found on Icinga, unable to downtime it
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Host is already powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

In T336236#8836351, @ops-monitoring-bot wrote:

Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)

Don't worry about this, is an artifact of the half-failed step in the previous run, the wipefs was run successfully:

2023-05-09 09:45:18,722 aborrero 2453874 [INFO actions.py:125 in _action] Wiped all swraid, partition-table and filesystem signatures

In T336236#8836366, @Volans wrote:

In T336236#8836351, @ops-monitoring-bot wrote:

Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)

Don't worry about this, is an artifact of the half-failed step in the previous run, the wipefs was run successfully:

Thanks!

hey @Papaul could you please physical connect this host to cloudsw1-b1-codfw instead of asw-b1-codfw?

https://netbox.wikimedia.org/dcim/devices/2067/

I guess that's needed before reimaging with the new address / role per https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging

aborrero reassigned this task from aborrero to Papaul.May 9 2023, 11:15 AM

Maintenance_bot added a project: SRE.May 9 2023, 11:29 AM

Change 917866 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudcontrol2001-dev: drop references to the old FQDN

https://gerrit.wikimedia.org/r/917866

Change 917866 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudcontrol2001-dev: drop references to the old FQDN

https://gerrit.wikimedia.org/r/917866

Papaul moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.May 9 2023, 1:09 PM

@Papaul The port being used is xe-1/0/25

Hostname is the same, but domain is changing, from cloudcontrol2001-dev.wikimedia.org to cloudcontrol2001-dev.codfw.wmnet, so also the IPv4 address from public to private.

In T336236#8837014, @Jhancock.wm wrote:

@Papaul The port being used is xe-1/0/25

In case is relevant, before the decommission step, cable was #2023 switch asw-b1-codfw port ge-1/0/18.

Also, note there is some port information on T327919#8699523, not sure if relevant here though.

@aborrero the move from old switch to new switch is complete.

Change 899614 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudcontrol2001-dev: introduce cloudlb support

https://gerrit.wikimedia.org/r/899614

Change 917910 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimedia.cloud: add entry for cloudcontrol2001-dev

https://gerrit.wikimedia.org/r/917910

Change 917910 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wikimedia.cloud: add entry for cloudcontrol2001-dev

https://gerrit.wikimedia.org/r/917910

Maintenance_bot removed a project: Patch-For-Review.May 9 2023, 4:12 PM

I got this @Papaul:

aborrero@cumin2002:~ 1 $ sudo cookbook sre.hosts.reimage --os bullseye --new -t T336236 cloudcontrol2001-dev
==> ATTENTION: destructive action for host: cloudcontrol2001-dev
Are you sure to proceed?
Type "go" to proceed or "abort" to interrupt the execution
> go
User input is: "go"
Exception raised while initializing the Cookbook sre.hosts.reimage:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 197, in run
    runner = self.instance.get_runner(args)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 88, in get_runner
    return ReimageRunner(args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 111, in __init__
    self.mgmt_fqdn = self.netbox_server.mgmt_fqdn
  File "/usr/lib/python3/dist-packages/spicerack/netbox.py", line 360, in mgmt_fqdn
    raise NetboxError(f"Server {self._server.name} has no management interface with a DNS name set.")
spicerack.netbox.NetboxError: Server cloudcontrol2001-dev has no management interface with a DNS name set.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudbackup2001-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudbackup2001-dev.codfw.wmnet with OS bullseye executed with errors:

cloudcontrol2001-dev (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye

Change 917928 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Update cloudcontrol2001 in netboot.cfg file

https://gerrit.wikimedia.org/r/917928

Change 917928 merged by Papaul:

[operations/puppet@production] Update cloudcontrol2001 in netboot.cfg file

https://gerrit.wikimedia.org/r/917928

Maintenance_bot removed a project: Patch-For-Review.May 9 2023, 5:10 PM

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye executed with errors:

cloudcontrol2001-dev (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye executed with errors:

cloudcontrol2001-dev (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye executed with errors:

cloudcontrol2001-dev (FAIL)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye executed with errors:

cloudcontrol2001-dev (FAIL)
- Downtimed on Icinga/Alertmanager
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- The reimage failed, see the cookbook logs for the details

Change 917959 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add cloudcontrol2001-dev with role insetup

https://gerrit.wikimedia.org/r/917959

Change 917959 merged by Papaul:

[operations/puppet@production] Add cloudcontrol2001-dev with role insetup

https://gerrit.wikimedia.org/r/917959

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye

Maintenance_bot removed a project: Patch-For-Review.May 9 2023, 10:30 PM

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye completed:

cloudcontrol2001-dev (PASS)
- Downtimed on Icinga/Alertmanager
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305092232_pt1979_3351418_cloudcontrol2001-dev.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

@aborrero i went ahead and setup the node in site.pp with role::insetup::wmcs and re-image it so you can put the node in service.

Thanks

In T336236#8839386, @Papaul wrote:

@aborrero i went ahead and setup the node in site.pp with role::insetup::wmcs and re-image it so you can put the node in service.

Thanks!

Change 918390 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudcontrol2001-dev: give it proper role

https://gerrit.wikimedia.org/r/918390

gerritbot added a project: Patch-For-Review.May 10 2023, 8:47 AM

Change 918390 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudcontrol2001-dev: give it proper role

https://gerrit.wikimedia.org/r/918390

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye

Maintenance_bot removed a project: Patch-For-Review.May 10 2023, 9:10 AM

Change 918414 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudlb: stronger openstack_controllers override

https://gerrit.wikimedia.org/r/918414

gerritbot added a project: Patch-For-Review.May 10 2023, 9:54 AM

Change 918414 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudlb: stronger openstack_controllers override

https://gerrit.wikimedia.org/r/918414

Maintenance_bot removed a project: Patch-For-Review.May 10 2023, 10:10 AM

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye completed:

cloudcontrol2001-dev (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305100909_aborrero_3712674_cloudcontrol2001-dev.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

As of today, cloudcontrol2001-dev.codfw.wmnet is a backend to cloudlb200X-dev.codfw.wmnet and they are communicating over the cloud-private vlan.

cmooney awarded a token.May 11 2023, 10:35 AM

Change 919210 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/dns@master] update codfw1dev rabbitmq01 cname for new cloudcontrol2001-dev

https://gerrit.wikimedia.org/r/919210

gerritbot added a project: Patch-For-Review.May 11 2023, 6:34 PM

Change 919210 merged by Andrew Bogott:

[operations/dns@master] update codfw1dev rabbitmq01 cname for new cloudcontrol2001-dev

https://gerrit.wikimedia.org/r/919210

Maintenance_bot removed a project: Patch-For-Review.May 11 2023, 7:11 PM

Change 919217 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Temporarily mark out refs to cloudrabbit01 in codfw1dev

https://gerrit.wikimedia.org/r/919217

Change 919217 merged by Andrew Bogott:

[operations/puppet@production] Temporarily mark out refs to cloudrabbit01 in codfw1dev

https://gerrit.wikimedia.org/r/919217

Maintenance_bot removed a project: Patch-For-Review.May 11 2023, 8:11 PM

aborrero mentioned this in T336564: cloudcontrol2005-dev: make it a cloudlb backend.May 12 2023, 9:19 AM

hi, cloudcontrol2001-dev is failing to do all its backups. Usually this is due to maintenance or a defect on setup:

root@backup1001:~$ check_bacula.py cloudcontrol2001-dev.codfw.wmnet-Monthly-1st-Wed-productionEqiad-mysql-srv-backups
id: 509222, ts: 2023-05-11 04:11:31, type: F, status: f, bytes: 0
id: 509381, ts: 2023-05-12 04:11:28, type: F, status: f, bytes: 0
id: 509539, ts: 2023-05-13 04:11:10, type: F, status: f, bytes: 0
id: 509698, ts: 2023-05-14 04:11:14, type: F, status: f, bytes: 0
id: 509868, ts: 2023-05-15 04:14:10, type: F, status: f, bytes: 0
✔️

I can file this on a separate ticket, if needed.

I see what is going on- the backups are happening, but they return empty- which is a weird setup and we interpret as a failure (not intended).
We can do 2 things- either putting some files on /srv/backups (e.g. if dumping is not happening but it should) or ignore failures on this machine.

Change 919781 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Ignore cloudcontrol2001-dev backup errors (no file backed up)

https://gerrit.wikimedia.org/r/919781

gerritbot added a project: Patch-For-Review.May 15 2023, 7:13 AM

In T336236#8849618, @jcrespo wrote:

I see what is going on- the backups are happening, but they return empty- which is a weird setup and we interpret as a failure (not intended).
We can do 2 things- either putting some files on /srv/backups (e.g. if dumping is not happening but it should) or ignore failures on this machine.

this is unexpected. I'll definitely try to understand why that happened.

It could be a mere race condition: the server was reimaged and the mysql backup file wasn't produced before bacule tried to back up it.

Change 919781 merged by Jcrespo:

[operations/puppet@production] bacula: Ignore cloudcontrol2001-dev backup errors (no file backed up)

https://gerrit.wikimedia.org/r/919781

Maintenance_bot removed a project: Patch-For-Review.May 15 2023, 9:12 AM

fnegri moved this task from In progress to Done on the cloud-services-team (FY2022/2023-Q4) board.Jul 27 2023, 3:13 PM

cloudcontrol2001-dev: make it a cloudlb backendClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

cloudcontrol2001-dev: make it a cloudlb backend
Closed, ResolvedPublic
Actions

Related Objects
Search...