Page MenuHomePhabricator

cloudrabbit: connect them via cloudsw and cloud-private
Closed, ResolvedPublic

Description

Now that the work in T341060: openstack eqiad1: introduce cloud-private and cloudlb enabled the cloud-private in eqiad1, there is no longer any reason for cloudrabbit hosts to have a .wikimedia.org domain with a public IPv4 address.

As of today, this affects:

hostoriginal racknew rackstatus
cloudrabbit1001B2TBDready
cloudrabbit1002C4E4ready
cloudrabbit1003D4F4done

We may consider:

  • reracking each server into a cloudsw-enabled rack
  • reimage with a new IP a new domain .eqiad.wmnet
  • enable cloud-private on them
  • enable connectivity with rabbitmq via cloud-private

Event Timeline

Change 959198 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] hieradata: fix eqiad1 rabbitmq firewall rules

https://gerrit.wikimedia.org/r/959198

Change 959198 merged by Majavah:

[operations/puppet@production] hieradata: fix eqiad1 rabbitmq firewall rules

https://gerrit.wikimedia.org/r/959198

Thanks for the task! Yeah these actually kicked off the whole project to some extent, so good to move them.

I gather they don't need to go behind the cloudlb, or use a shared VIP? If so hopefully it'd be straightforward to re-provision with the new network setup, and make the Rabbit MQ service available on cloud-private.

We should move any other hosts that need them to the cloud-specific racks before proceeding.

I'll need to double-check but I think it's enough that everything else can talk to the Rabbit ports on the cloud-private addresses. The remaining cloudcontrols and cloudvirt-wdqs hosts will indeed need to be moved first.

taavi edited projects, added Cloud-VPS; removed User-aborrero.

This can move forward now, although due to the nature of Rabbit this needs to be coordinated to avoid unnecessary downtime. It's fine if this goes to January, I'd much prefer that to a non-coordinated move next week. cloudrabbit1001 is the current cluster leader so it should be moved last.

These hosts will need a single connection with the rack-specific cloud-hosts VLAN as primary and cloud-private trunked. They need to be spread out so there's no more than one node per rack.

Nice ! So next step here is to decom the current ones and then sync up with DCops to move them to the proper racks. From there we can re-provision them with the proper network settings.

Change 989196 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/dns@master] wikimediacloud.org: point rabbitmq03.eqiad1 to cloudrabbit1002

https://gerrit.wikimedia.org/r/989196

Change 989196 merged by Majavah:

[operations/dns@master] wikimediacloud.org: point rabbitmq03.eqiad1 to cloudrabbit1002

https://gerrit.wikimedia.org/r/989196

cookbooks.sre.hosts.decommission executed by taavi@cumin1002 for hosts: cloudrabbit1003.wikimedia.org

  • cloudrabbit1003.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

This can move forward now, although due to the nature of Rabbit this needs to be coordinated to avoid unnecessary downtime. It's fine if this goes to January, I'd much prefer that to a non-coordinated move next week. cloudrabbit1001 is the current cluster leader so it should be moved last.

These hosts will need a single connection with the rack-specific cloud-hosts VLAN as primary and cloud-private trunked. They need to be spread out so there's no more than one node per rack.

cloudrabbit1003 is now ready to be moved.

taavi removed taavi as the assignee of this task.Jan 9 2024, 4:24 PM
taavi added a project: DC-Ops.

Change 992234 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] site: Add cloudrabbit1003 as insetup

https://gerrit.wikimedia.org/r/992234

Change 992234 merged by Majavah:

[operations/puppet@production] site: Add cloudrabbit1003 as insetup

https://gerrit.wikimedia.org/r/992234

Physically moved the server to F4, U18. Port 4 CableID 2M-20220019

Change 992245 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/homer/public@master] cr-labs: Add temporary term for cloudrabbit hosts

https://gerrit.wikimedia.org/r/992245

Change 992245 merged by jenkins-bot:

[operations/homer/public@master] cr-labs: Add temporary term for cloudrabbit hosts

https://gerrit.wikimedia.org/r/992245

Change 992884 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/dns@master] wikimediacloud.org: Move RabbitMQ traffic to cloudrabbit1003

https://gerrit.wikimedia.org/r/992884

Change 992884 merged by Majavah:

[operations/dns@master] wikimediacloud.org: Move RabbitMQ traffic to cloudrabbit1003

https://gerrit.wikimedia.org/r/992884

Mentioned in SAL (#wikimedia-cloud) [2024-01-25T16:48:36Z] <andrewbogott> taavi just moved all rabbitmq traffic to cloudrabbit1003 as part of T345610

cookbooks.sre.hosts.decommission executed by taavi@cumin1002 for hosts: cloudrabbit[1001-1002].wikimedia.org

  • cloudrabbit1001.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cloudrabbit1002.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
VRiley-WMF updated the task description. (Show Details)

cloudrabbit1002 is now in

E4
U17
CableID 2M-20220016
Port 3

cloudrabbit1001 is now in

C8
U19
CableID 5336
Port 21

Change 993026 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] Move cloudrabbit1001/2 to private vlan

https://gerrit.wikimedia.org/r/993026

Change 993026 merged by Majavah:

[operations/puppet@production] Move cloudrabbit1001/2 to private vlan

https://gerrit.wikimedia.org/r/993026

Mentioned in SAL (#wikimedia-cloud) [2024-01-26T09:01:15Z] <taavi> joining cloudrabbit1001/2 to the cluster on 1003 T345610

taavi claimed this task.

Change 993062 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/dns@master] wikimediacloud.org: Move Rabbit traffic back to all nodes

https://gerrit.wikimedia.org/r/993062

Change 993062 merged by Majavah:

[operations/dns@master] wikimediacloud.org: Move Rabbit traffic back to all nodes

https://gerrit.wikimedia.org/r/993062

Change 998419 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:openstack: rabbitmq: cleanup rabbitmq firewall

https://gerrit.wikimedia.org/r/998419