Page MenuHomePhabricator

Move cloudvirt-wdqs hosts
Closed, ResolvedPublic

Description

We need to move cloudvirt-wdqs hosts to WMCS-dedicated racks with cloudsw devices so that they can connect to the new cloud-private network.

https://netbox.wikimedia.org/search/?q=cloudvirt-wdqs&obj_type=

Event Timeline

@bking two questions

  1. (a repeat of T324147) do y'all still want these servers to do things that can't be done using ceph for file storage?
  2. Right now there's only one VM living on that cluster, 'vespa01'. Does that server need to survive the move documented in this ticket, or can we delete it and leave it for you to rebuild after?

I think we can also solve this in the short term by allowing cloud-host vlan traffic to openstack.eqiad1.wikimediacloud.org via CR firewall in homer.

Change 959706 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/homer/public@master] policies/cr-labs: refresh openstack API endpoints

https://gerrit.wikimedia.org/r/959706

Change 959706 merged by Arturo Borrero Gonzalez:

[operations/homer/public@master] policies/cr-labs: refresh openstack API endpoints

https://gerrit.wikimedia.org/r/959706

Change 959955 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] Fix puppet on cloudvirt-wdqs* until they have been moved

https://gerrit.wikimedia.org/r/959955

Change 959955 merged by Majavah:

[operations/puppet@production] Fix puppet on cloudvirt-wdqs* until they have been moved

https://gerrit.wikimedia.org/r/959955

@bking two questions

  1. (a repeat of T324147) do y'all still want these servers to do things that can't be done using ceph for file storage?

We still want to keep these servers as they may be used in our graph splitting experiment . Tagging my team lead @Gehel in case he has a different opinion.

  1. Right now there's only one VM living on that cluster, 'vespa01'. Does that server need to survive the move documented in this ticket, or can we delete it and leave it for you to rebuild after?

I'm assuming no, but @EBernhardson can confirm/deny.

If you don't hear back within a week, feel free to shut down and move the hosts. If it's more urgent than that, ping us in Wikimedia-Search IRC and we'll get back to you as soon as we can. Sorry for not seeing the ping earlier!

@bking two questions

  1. (a repeat of T324147) do y'all still want these servers to do things that can't be done using ceph for file storage?

We still want to keep these servers as they may be used in our graph splitting experiment . Tagging my team lead @Gehel in case he has a different opinion.

I believe this is correct, we will need the instances for some of the evaluations and testing around graph splitting.

  1. Right now there's only one VM living on that cluster, 'vespa01'. Does that server need to survive the move documented in this ticket, or can we delete it and leave it for you to rebuild after?

I'm assuming no, but @EBernhardson can confirm/deny.

The vespa instance can be deleted at any time, it's a test instance that can easily be recreated as necessary.

If you don't hear back within a week, feel free to shut down and move the hosts. If it's more urgent than that, ping us in Wikimedia-Search IRC and we'll get back to you as soon as we can. Sorry for not seeing the ping earlier!

Thanks!

taavi@cloudcontrol1006 ~ $ os server list --long --all-projects --host cloudvirt-wdqs1001
+--------------------------------------+-------------------+--------+------------+-------------+----------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+
| ID                                   | Name              | Status | Task State | Power State | Networks                               | Image Name                                   | Image ID                             | Flavor Name           | Flavor ID                            | Availability Zone | Host               | Properties              |
+--------------------------------------+-------------------+--------+------------+-------------+----------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+
| d9e946ff-cb7e-49a3-9ca1-b1def7f76a40 | vespa01           | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.2.195 | debian-11.0-bullseye (deprecated 2023-06-08) | e69cb6f7-e5c7-41de-b08d-8e5739c20de3 | t206636v2             | 524457d1-e000-42a8-a4af-8e4659e966af | nova              | cloudvirt-wdqs1001 |                         |
| 9e4a229b-386f-47f9-a62f-50f3b518559e | canary-wdqs1001-2 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.6.108 | debian-11.0-bullseye (deprecated 2023-01-12) | 9a01d3d8-e793-4775-8b81-434f68c687a7 | g3.cores1.ram1.disk20 | bf48880d-0c1b-4c2a-8e8b-778d28b16561 | nova              | cloudvirt-wdqs1001 | description='canary VM' |
+--------------------------------------+-------------------+--------+------------+-------------+----------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+
taavi@cloudcontrol1006 ~ $ os server list --long --all-projects --host cloudvirt-wdqs1002
+--------------------------------------+-------------------+--------+------------+-------------+---------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+
| ID                                   | Name              | Status | Task State | Power State | Networks                              | Image Name                                   | Image ID                             | Flavor Name           | Flavor ID                            | Availability Zone | Host               | Properties              |
+--------------------------------------+-------------------+--------+------------+-------------+---------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+
| 775079dc-6959-47e6-8915-f527626b6ede | canary-wdqs1002-2 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.5.45 | debian-11.0-bullseye (deprecated 2023-01-12) | 9a01d3d8-e793-4775-8b81-434f68c687a7 | g3.cores1.ram1.disk20 | bf48880d-0c1b-4c2a-8e8b-778d28b16561 | nova              | cloudvirt-wdqs1002 | description='canary VM' |
+--------------------------------------+-------------------+--------+------------+-------------+---------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+
taavi@cloudcontrol1006 ~ $ os server list --long --all-projects --host cloudvirt-wdqs1003
+--------------------------------------+-------------------+--------+------------+-------------+----------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+
| ID                                   | Name              | Status | Task State | Power State | Networks                               | Image Name                                   | Image ID                             | Flavor Name           | Flavor ID                            | Availability Zone | Host               | Properties              |
+--------------------------------------+-------------------+--------+------------+-------------+----------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+
| e51c094f-95a6-43f6-bea9-3f806769faee | canary-wdqs1003-2 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.4.236 | debian-11.0-bullseye (deprecated 2023-01-12) | 9a01d3d8-e793-4775-8b81-434f68c687a7 | g3.cores1.ram1.disk20 | bf48880d-0c1b-4c2a-8e8b-778d28b16561 | nova              | cloudvirt-wdqs1003 | description='canary VM' |
+--------------------------------------+-------------------+--------+------------+-------------+----------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+

Mentioned in SAL (#wikimedia-cloud) [2023-10-03T13:08:21Z] <taavi> remove canary VMs from cloudvirt-wdqs hosts T346948

cookbooks.sre.hosts.decommission executed by taavi@cumin1001 for hosts: cloudvirt-wdqs1002.eqiad.wmnet

  • cloudvirt-wdqs1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by taavi@cumin1001 for hosts: cloudvirt-wdqs1003.eqiad.wmnet

  • cloudvirt-wdqs1003.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by taavi@cumin1001 for hosts: cloudvirt-wdqs1001.eqiad.wmnet

  • cloudvirt-wdqs1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
taavi added a project: ops-eqiad.

Hi! Can we please have cloudvirt-wdqs100[1-3] moved to the WMCS racks, preferrably E4 or F4? They will all need a single connection to the rack-specific cloud-hosts VLAN.

@taavi What vlan are these going to be I would like to verify with @cmooney that these can go into these racks before i physically move them.

Thanks @Jclark-ctr yes these can go in E4 or F4 no problem.

@Jclark-ctr these need a single NIC connected to the cloud-hosts as the primary VLAN, and cloud-instances and cloud-private VLANs trunked (we can take care of those).

Hey @taavi and @cmooney

Just wanted to see if there was a timeframe on this move. Like, a specific time when we know the servers aren't in use and can be powered down? Let us know and we can schedule it. Thank you!

Just wanted to see if there was a timeframe on this move. Like, a specific time when we know the servers aren't in use and can be powered down? Let us know and we can schedule it. Thank you!

I've already shut these down, they can be moved any time. Thanks!

New locations are as follows

cloudvirt-wdqs1001 - E 4. U 18. port 35. CableID 70824500012

cloudvirt-wdqs1002 - F 4. U 19. port 35. CableID 20220058

New locations are as follows

cloudvirt-wdqs1001 - E 4. U 18. port 35. CableID 70824500012

cloudvirt-wdqs1002 - F 4. U 19. port 35. CableID 20220058

Thanks Valerie. I think as @Jclark-ctr pointed out to me there is a problem with placing these hosts in those locations, due to the fact they are only 1G hosts, and the switches in racks E4/F4 have a constraint which doesn't let us mix-and-match port speeds (see T349735).

@Jclark-ctr I think you said you managed to find two 10G NICs, perhaps we could leave these two where they are now and use those?

cloudvirt-wdqs1002 - F 4. U 19. port 35. CableID 20220058

Thanks! I'm getting a duplicate cable ID alert for this one - looks like that ID is already used [[ on cloudvirt1053 | https://netbox.wikimedia.org/dcim/cables/5400/ ]]?

Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm executed with errors:

  • cloudvirt-wdqs1001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm completed:

  • cloudvirt-wdqs1001 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310262014_taavi_970775_cloudvirt-wdqs1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

cloudvirt-wdqs1003 has been relocated

cloudvirt-wdqs1003 - C 8. U 21. port 18. CableID 4015

Side note, we had to use a 1 Gig connection since there were no extra 10G cards available to install on this host.

Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1003.eqiad.wmnet with OS bookworm completed:

  • cloudvirt-wdqs1003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310301602_taavi_3554895_cloudvirt-wdqs1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm executed with errors:

  • cloudvirt-wdqs1002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm

@VRiley-WMF cloudvirt-wdqs1002 is showing a media/cable failure when it tries to boot over network:

image.png (273×1 px, 73 KB)

That could be that the NIC we had is faulty, but probably more likely to be the DAC cable. Can you check it's properly connected both sides, and if so maybe replace it with another one? Thanks.

cloudvirt-wdqs1003 has been relocated

cloudvirt-wdqs1003 - C 8. U 21. port 18. CableID 4015

Side note, we had to use a 1 Gig connection since there were no extra 10G cards available to install on this host.

That's fine, cloudsw1-c8-eqiad is a slightly older model and doesn't have the restriction about setting port speeds in blocks of 4.

@cmooney I have replaced the DAC cable and updated Netbox with the CableID; also I reseated the NIC for good measure. It is plugged into the same port (35).

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm executed with errors:

  • cloudvirt-wdqs1002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm

@Jclark-ctr had a look at the NIC riser card wasn't properly seated. After re-seating the card the server connection seems to be working, currently installing the OS.

Icinga downtime and Alertmanager silence (ID=b9bd4e38-25ed-4ed0-bdf7-47bd52027bdc) set by cmooney@cumin1001 for 1:00:00 on 1 host(s) and their services with reason: moving switch link from NIC port 2 to port 1

cloudvirt-wdqs1001.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm completed:

  • cloudvirt-wdqs1002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311011659_cmooney_789096_cloudvirt-wdqs1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

I believe this is all done. Thank you everyone!