We need to move cloudvirt-wdqs hosts to WMCS-dedicated racks with cloudsw devices so that they can connect to the new cloud-private network.
https://netbox.wikimedia.org/search/?q=cloudvirt-wdqs&obj_type=
| taavi | |
| Sep 20 2023, 5:04 PM |
| F41426317: image.png | |
| Nov 1 2023, 2:03 PM |
We need to move cloudvirt-wdqs hosts to WMCS-dedicated racks with cloudsw devices so that they can connect to the new cloud-private network.
https://netbox.wikimedia.org/search/?q=cloudvirt-wdqs&obj_type=
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | • aborrero | T296411 cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet | |||
| Resolved | • aborrero | T297596 have cloud hardware servers in the cloud realm using a dedicated LB layer | |||
| Resolved | taavi | T341060 openstack eqiad1: introduce cloud-private and cloudlb | |||
| Resolved | Jclark-ctr | T341494 cloud @ eqiad: hardware re-racking plan | |||
| Resolved | taavi | T346651 cloudvirt: eqiad1: connect them to cloud-private | |||
| Resolved | Jclark-ctr | T346948 Move cloudvirt-wdqs hosts |
@bking two questions
I think we can also solve this in the short term by allowing cloud-host vlan traffic to openstack.eqiad1.wikimediacloud.org via CR firewall in homer.
Change 959706 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/homer/public@master] policies/cr-labs: refresh openstack API endpoints
Change 959706 merged by Arturo Borrero Gonzalez:
[operations/homer/public@master] policies/cr-labs: refresh openstack API endpoints
Mentioned in SAL (#wikimedia-operations) [2023-09-21T11:08:31Z] <arturo> merging homer CR firewall patch https://gerrit.wikimedia.org/r/c/operations/homer/public/+/959706 for T346948
Change 959955 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/puppet@production] Fix puppet on cloudvirt-wdqs* until they have been moved
Change 959955 merged by Majavah:
[operations/puppet@production] Fix puppet on cloudvirt-wdqs* until they have been moved
We still want to keep these servers as they may be used in our graph splitting experiment . Tagging my team lead @Gehel in case he has a different opinion.
- Right now there's only one VM living on that cluster, 'vespa01'. Does that server need to survive the move documented in this ticket, or can we delete it and leave it for you to rebuild after?
I'm assuming no, but @EBernhardson can confirm/deny.
If you don't hear back within a week, feel free to shut down and move the hosts. If it's more urgent than that, ping us in Wikimedia-Search IRC and we'll get back to you as soon as we can. Sorry for not seeing the ping earlier!
I believe this is correct, we will need the instances for some of the evaluations and testing around graph splitting.
- Right now there's only one VM living on that cluster, 'vespa01'. Does that server need to survive the move documented in this ticket, or can we delete it and leave it for you to rebuild after?
I'm assuming no, but @EBernhardson can confirm/deny.
The vespa instance can be deleted at any time, it's a test instance that can easily be recreated as necessary.
If you don't hear back within a week, feel free to shut down and move the hosts. If it's more urgent than that, ping us in Wikimedia-Search IRC and we'll get back to you as soon as we can. Sorry for not seeing the ping earlier!
Thanks!
taavi@cloudcontrol1006 ~ $ os server list --long --all-projects --host cloudvirt-wdqs1001 +--------------------------------------+-------------------+--------+------------+-------------+----------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+ | ID | Name | Status | Task State | Power State | Networks | Image Name | Image ID | Flavor Name | Flavor ID | Availability Zone | Host | Properties | +--------------------------------------+-------------------+--------+------------+-------------+----------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+ | d9e946ff-cb7e-49a3-9ca1-b1def7f76a40 | vespa01 | ACTIVE | None | Running | lan-flat-cloudinstances2b=172.16.2.195 | debian-11.0-bullseye (deprecated 2023-06-08) | e69cb6f7-e5c7-41de-b08d-8e5739c20de3 | t206636v2 | 524457d1-e000-42a8-a4af-8e4659e966af | nova | cloudvirt-wdqs1001 | | | 9e4a229b-386f-47f9-a62f-50f3b518559e | canary-wdqs1001-2 | ACTIVE | None | Running | lan-flat-cloudinstances2b=172.16.6.108 | debian-11.0-bullseye (deprecated 2023-01-12) | 9a01d3d8-e793-4775-8b81-434f68c687a7 | g3.cores1.ram1.disk20 | bf48880d-0c1b-4c2a-8e8b-778d28b16561 | nova | cloudvirt-wdqs1001 | description='canary VM' | +--------------------------------------+-------------------+--------+------------+-------------+----------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+ taavi@cloudcontrol1006 ~ $ os server list --long --all-projects --host cloudvirt-wdqs1002 +--------------------------------------+-------------------+--------+------------+-------------+---------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+ | ID | Name | Status | Task State | Power State | Networks | Image Name | Image ID | Flavor Name | Flavor ID | Availability Zone | Host | Properties | +--------------------------------------+-------------------+--------+------------+-------------+---------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+ | 775079dc-6959-47e6-8915-f527626b6ede | canary-wdqs1002-2 | ACTIVE | None | Running | lan-flat-cloudinstances2b=172.16.5.45 | debian-11.0-bullseye (deprecated 2023-01-12) | 9a01d3d8-e793-4775-8b81-434f68c687a7 | g3.cores1.ram1.disk20 | bf48880d-0c1b-4c2a-8e8b-778d28b16561 | nova | cloudvirt-wdqs1002 | description='canary VM' | +--------------------------------------+-------------------+--------+------------+-------------+---------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+ taavi@cloudcontrol1006 ~ $ os server list --long --all-projects --host cloudvirt-wdqs1003 +--------------------------------------+-------------------+--------+------------+-------------+----------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+ | ID | Name | Status | Task State | Power State | Networks | Image Name | Image ID | Flavor Name | Flavor ID | Availability Zone | Host | Properties | +--------------------------------------+-------------------+--------+------------+-------------+----------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+ | e51c094f-95a6-43f6-bea9-3f806769faee | canary-wdqs1003-2 | ACTIVE | None | Running | lan-flat-cloudinstances2b=172.16.4.236 | debian-11.0-bullseye (deprecated 2023-01-12) | 9a01d3d8-e793-4775-8b81-434f68c687a7 | g3.cores1.ram1.disk20 | bf48880d-0c1b-4c2a-8e8b-778d28b16561 | nova | cloudvirt-wdqs1003 | description='canary VM' | +--------------------------------------+-------------------+--------+------------+-------------+----------------------------------------+----------------------------------------------+--------------------------------------+-----------------------+--------------------------------------+-------------------+--------------------+-------------------------+
Mentioned in SAL (#wikimedia-cloud) [2023-10-03T13:08:21Z] <taavi> remove canary VMs from cloudvirt-wdqs hosts T346948
cookbooks.sre.hosts.decommission executed by taavi@cumin1001 for hosts: cloudvirt-wdqs1002.eqiad.wmnet
cookbooks.sre.hosts.decommission executed by taavi@cumin1001 for hosts: cloudvirt-wdqs1003.eqiad.wmnet
cookbooks.sre.hosts.decommission executed by taavi@cumin1001 for hosts: cloudvirt-wdqs1001.eqiad.wmnet
Hi! Can we please have cloudvirt-wdqs100[1-3] moved to the WMCS racks, preferrably E4 or F4? They will all need a single connection to the rack-specific cloud-hosts VLAN.
@Jclark-ctr these need a single NIC connected to the cloud-hosts as the primary VLAN, and cloud-instances and cloud-private VLANs trunked (we can take care of those).
New locations are as follows
cloudvirt-wdqs1001 - E 4. U 18. port 35. CableID 70824500012
cloudvirt-wdqs1002 - F 4. U 19. port 35. CableID 20220058
Thanks Valerie. I think as @Jclark-ctr pointed out to me there is a problem with placing these hosts in those locations, due to the fact they are only 1G hosts, and the switches in racks E4/F4 have a constraint which doesn't let us mix-and-match port speeds (see T349735).
@Jclark-ctr I think you said you managed to find two 10G NICs, perhaps we could leave these two where they are now and use those?
Thanks! I'm getting a duplicate cable ID alert for this one - looks like that ID is already used [[ on cloudvirt1053 | https://netbox.wikimedia.org/dcim/cables/5400/ ]]?
Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm
Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm
Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm completed:
cloudvirt-wdqs1003 has been relocated
cloudvirt-wdqs1003 - C 8. U 21. port 18. CableID 4015
Side note, we had to use a 1 Gig connection since there were no extra 10G cards available to install on this host.
Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1003.eqiad.wmnet with OS bookworm
Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1003.eqiad.wmnet with OS bookworm completed:
Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm
Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm
@VRiley-WMF cloudvirt-wdqs1002 is showing a media/cable failure when it tries to boot over network:
That could be that the NIC we had is faulty, but probably more likely to be the DAC cable. Can you check it's properly connected both sides, and if so maybe replace it with another one? Thanks.
That's fine, cloudsw1-c8-eqiad is a slightly older model and doesn't have the restriction about setting port speeds in blocks of 4.
@cmooney I have replaced the DAC cable and updated Netbox with the CableID; also I reseated the NIC for good measure. It is plugged into the same port (35).
Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm
@Jclark-ctr had a look at the NIC riser card wasn't properly seated. After re-seating the card the server connection seems to be working, currently installing the OS.
Icinga downtime and Alertmanager silence (ID=b9bd4e38-25ed-4ed0-bdf7-47bd52027bdc) set by cmooney@cumin1001 for 1:00:00 on 1 host(s) and their services with reason: moving switch link from NIC port 2 to port 1
cloudvirt-wdqs1001.eqiad.wmnet
Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm completed: