Page MenuHomePhabricator

EQSIN: Setup VRRP on both routers for the new subnets
Closed, ResolvedPublic

Description

In this task We will setup VRRP on both routers for the new subnets in rack 0604. This will allow the traffic team to re-image nodes in the rack before we received and install the new switches.

All the steps below need to be done in Netbox and after run homer on the core routers

  • Create the FHRP Groups for the new VLAN's
  • group id 3 for public1-604-eqsin IPV4 and IPV6 use ip addresses 103.102.166.33/27 and 2001:df2:e500:2::1/64
  • group id 4 for private1-604-eqsin IPV4 and IPV6 use ip addresses 10.132.1.1/24 and 2001:df2:e500:102::1/64
  • Create VLAN's
  • public1-604-eqsin with vlan-id 512
  • private1-604-eqsin with vlan-id 522
  • Create Prefix
  • 103.102.166.32/27 assign it vlan public1-604-eqsin 512
  • 2001:df2:e500:2::/64 assign it vlan public1-604-eqsin 512
  • 10.132.1.0/24 assign it vlan private1-604-eqsin 522
  • 2001:df2:e500:102::/64 it vlan private1-604-eqsin 522
  • Set up VRRP on routers

For cr2-eqsin

  • Create interface ae1.512 for public1-604-eqsin with ip address 103.102.166.34/27 and 2001:df2:e500:2::2/64
  • Create interface ae1.522 for private1-604-eqsin with ip address 10.132.1.2/24 and 2001:df2:e500:102::2/64
  • Assign the FHRP group 3 with the virtual ip 103.102.166.33/27 to interface ae1.512 with priority 90
  • Assign the FHTP group 3 with the vitrtaul ip 2001:df2:e500:2::1/64 to interface ae1.512 with priority 110
  • Assign the FHRP group 4 with the virtual ip 10.132.1.1to interface ae1.522 with priority 90
  • Assign the FHTP group 4 with the vitrtaul ip 2001:df2:e500:102::1/64 to interface ae1.522 with priority 110

For cr3-eqsin

  • Create interface ae1.512 for public1-604-eqsin with ip address 103.102.166.35/27 and 2001:df2:e500:2::3/64
  • Create interface ae1.522 for private1-604-eqsin with ip address 10.132.1.3/24 and 2001:df2:e500:102::3/64
  • Assign the FHRP group 3 with the virtual ip 103.102.166.33/27 to interface ae1.512 with priority 110
  • Assign the FHTP group 3 with the vitrtaul ip 2001:df2:e500:2::1/64 to interface ae1.512 with priority 90
  • Assign the FHRP group 4 with the virtual ip 10.132.1.1to interface ae1.522 with priority 110
  • Assign the FHTP group 4 with the vitrtaul ip 2001:df2:e500:102::1/64 to interface ae1.522 with priority 90
  • on asw-0604-eqsin
  • Add vlan public1-604-eqsin to interface ae1 and ae2
  • Add vlan private1-604-eqsin to interface ae1 and ae2
  • Homer
  • Add the new VLAN's to the data.yaml file
  • Add interface ae1.512 and ae1.522 to dhcp relay in site.yaml under dhcp_relay_ra
  • Test re-image
  • Puppet
  • Update hieradata/common.yaml with the new IPs
  • mediawiki-config
  • Add new IP subnets/prefixes to wmf-config/reverse-proxy.php

Event Timeline

That looks good to me @Papaul good stuff.

If we use vlan IDs 512/522 I guess the plan would be to change the vlan id for the existing vlan when we rename it? That should be ok, I'd recommend we do that during our actual window when migrating to new switches (just in case rename/id change causes a blip in traffic).

@cmooney yes we will change the VLAN-id and rename the VLAN for rack 0603 during the switch migration. so it will be 511 and 521. see https://phabricator.wikimedia.org/T418439 for the irb interface creation.

VRRP is up on cr2-eqsin

cr2-eqsin> show interfaces terse | match "ae1.512|ae1.522"    
et-0/0/1.512            up    up   aenet    --> ae1.512
et-0/0/1.522            up    up   aenet    --> ae1.522
ae1.512                 up    up   inet     103.102.166.33/27
ae1.522                 up    up   inet     10.132.1.1/24
cr2-eqsin> show vrrp brief | match "ae1.512|ae1.522" 
ae1.512       up              3   master   Active      A  0.315 lcl    103.102.166.34 
ae1.512       up              3   master   Active      A  0.146 lcl    2001:df2:e500:2::2
ae1.522       up              4   master   Active      A  0.172 lcl    10.132.1.2     
ae1.522       up              4   master   Active      A  0.706 lcl    2001:df2:e500:102::2
cr2-eqsin> show vrrp summary | match "ae1.512|ae1.522"
ae1.512       up              3   master          Active    lcl    103.102.166.34     
ae1.512       up              3   master          Active    lcl    2001:df2:e500:2::2 
ae1.522       up              4   master          Active    lcl    10.132.1.2         
ae1.522       up              4   master          Active    lcl    2001:df2:e500:102::2

Change #1294487 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new Eqsin subnet

https://gerrit.wikimedia.org/r/1294487

@BCornwall hello can you please provide me with one CP node in rack 604 that i can use later on today to test the re-image on the new subnet?
https://netbox.wikimedia.org/dcim/racks/78/
Thanks

Papaul updated the task description. (Show Details)

Hi, @Papaul: cp5032 is depooled/downtimed and ready for reimaging.

@BCornwall thank you will do that after lunch doing some onsite work

Change #1294487 merged by Papaul:

[operations/puppet@production] Add new Eqsin subnet

https://gerrit.wikimedia.org/r/1294487

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cp5032.eqsin.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cp5032.eqsin.wmnet with OS trixie executed with errors:

  • cp5032 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cp5032.eqsin.wmnet" to get a root shell, but depending on the failure this may not work.

Mentioned in SAL (#wikimedia-operations) [2026-05-28T21:04:06Z] <pt1979@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "setup new eqsin vlan - pt1979@cumin2002 - T427393"

Mentioned in SAL (#wikimedia-operations) [2026-05-28T21:04:11Z] <pt1979@cumin2002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "setup new eqsin vlan - pt1979@cumin2002 - T427393"

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cp5032.eqsin.wmnet with OS trixie

Change #1295101 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/homer/public@master] Add interfaces ae1.512 and 522 to dhcp relay

https://gerrit.wikimedia.org/r/1295101

Change #1295101 merged by Papaul:

[operations/homer/public@master] Add interfaces ae1.512 and 522 to dhcp relay

https://gerrit.wikimedia.org/r/1295101

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cp5032.eqsin.wmnet with OS trixie completed:

  • cp5032 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202605290118_pt1979_2260576_cp5032.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@BCornwall re-image done on cp5032. The node is now on the new private1-604-eqsin vlan. The DHCP issue I was having, was that ae1.522 was not in the dhcp relay so i made a patch to add both ae1.512 and ae1.522

Sweet, thanks! I'll re-pool tomorrow.

Please see below steps before re-imaging a node into the new vlan

  • Netbox

1- Search for the node in Netbox
2- Click "interface" and find the interface(nic) that has both the IPV4 and IPV6 update the IP addresses by clicking on each IP address and Edit
Example:
if the IPV4 address is 10.132.0.16/24 change it to 10.132.1.16/24 (0 to 1)
if the IPV6 address is 2001:df2:e500:101:10:132:0:16/64 change it to 2001:df2:e500:102:10:132:1:16/64 (101 to 102 and 0 to 1)
4- Setup the switch interface to the new VLAN.

  • Search for asw-0604-eqsin
  • click on "interface"
  • Find the interface of the server you are working on
  • click on the interface (xe-1/0/x)
  • click on Edit on the top right corner
  • Navigate to "Untagged VLAN" and change it to 512 if public1-604 or 522 if private1-604

5- Run homer on the switch (sudo homer asw1-eqsin* commit "change vlan for cp5032")
6- Run the netbox dns cookbook

Please let me know if you have any questions. Note the re-image of all the servers in rack 0604 can be done at anytime based on your availability. It can be 1 server a day all depend on you.

Awesome work @Papaul!

I think possibly you can just reimage with the —move-vlan flag for the cookbook? We should check that and maybe fix any niggles if not, but I suspect it should work ok based on taking the vlan name based on rack.

@cmooney thank you, yes move-vlan flag cookbook will also work we need to test that. I don't think we have done any in POP sites.Also Eqsin is not per rack yet since it is using VC

--move-vlan is only made to migrate core DCs from legacy to new per rack vlans. Let me know if its worth spending time implementing support for this usecase.

--move-vlan is only made to migrate core DCs from legacy to new per rack vlans. Let me know if its worth spending time implementing support for this usecase.

Other than it not being a core-site the use-case seems identical here? I would say it's probably a matter of how much effort is involved, if it's not much totally worth it, if it's a lot of work then no.

ayounsi triaged this task as Medium priority.Mon, Jun 1, 2:20 PM

When pooling cp5032 I noticed that connection to kafka-jumbo1016.eqiad.wmnet:9093 (10.64.154.15 via 10.132.1.1 dev eno12399np0 src 10.132.1.16 uid 0) is timing out. I've depooled again: Is this something to do with the migration?

Change #1297221 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] common: Update cp5032 IP address

https://gerrit.wikimedia.org/r/1297221

Change #1297221 merged by BCornwall:

[operations/puppet@production] common: Update cp5032 IP address

https://gerrit.wikimedia.org/r/1297221

@ayounsi helpfully pointed out that I needed to update hieradata/common.yaml with the new IP addresses. Thanks!

Change #1297222 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] common: Fix cp5032 IPv6 address

https://gerrit.wikimedia.org/r/1297222

Change #1297222 merged by BCornwall:

[operations/puppet@production] common: Fix cp5032 IPv6 address

https://gerrit.wikimedia.org/r/1297222

Change #1297232 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/cookbooks@master] sre.hosts: Add eqsin old names to LEGACY_VLANS to support move-vlan

https://gerrit.wikimedia.org/r/1297232

Change #1297237 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/mediawiki-config@master] wmf-config: Add new private1-eqsin subnets

https://gerrit.wikimedia.org/r/1297237

BCornwall added a subscriber: taavi.

I was advised by @taavi to also update mediawiki-config's wmf-config/reverse-proxy.php ranges. I've updated the task desciption: @cmooney if this is a template somewhere it might be nice to include this new info (as well as the puppet hieradata/common.yaml updates) in it!

I was advised by @taavi to also update mediawiki-config's wmf-config/reverse-proxy.php ranges. I've updated the task desciption: @cmooney if this is a template somewhere it might be nice to include this new info (as well as the puppet hieradata/common.yaml updates) in it!

Thanks @BCornwall. Tbh I don't really know specifically for the given types of hosts what data structures are used to store the IPs. I do for sure know we have _way_ to much duplication of data in our automation requiring this kind of thing. I've added this to my own list for adding new vlans/subnets anyway, it wasn't on my radar before.

For now I'll submit a patch to update that for the new ranges we added in ulsfo recently following the work there, seems like it was also missed for that.

Actually @BCornwall I'm hoping to test https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1297232 , the goal of which is to make this vlan moving easier for you guys.

No rush, but when you have a moment is it possible to depool another cp node in rack 604, and I can try a reimage with the --move-vlan flag? I don't want to merge the patch before doing a full test just in case.

@cmooney I've depooled cp5030. Have fun!

Thanks!

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host cp5030.eqsin.wmnet with OS trixie

Change #1297768 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] cp5030: change IPs in hieradata to match its new ones

https://gerrit.wikimedia.org/r/1297768

Change #1297768 merged by Cathal Mooney:

[operations/puppet@production] cp5030: change IPs in hieradata to match its new ones

https://gerrit.wikimedia.org/r/1297768

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host cp5030.eqsin.wmnet with OS trixie completed:

  • cp5030 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202606041949_cmooney_1168445_cp5030.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Ok @BCornwall the reimage seemed to work fine with the --move-vlan tag. I updated the IPs in hiera so I think you should be able to give it a quick health check and repool unless there is anything else. Thanks!

Indeed, cp5030 is doing well, thanks! For the remainder of instances, should there be a separate task/some existing task?

Change #1297232 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts: Add eqsin old names to LEGACY_VLANS to support move-vlan

https://gerrit.wikimedia.org/r/1297232

Nice one. I think we can probably close this task now.

BCornwall updated the task description. (Show Details)

Change #1297237 merged by jenkins-bot:

[operations/mediawiki-config@master] wmf-config: Update private subnets to include additions

https://gerrit.wikimedia.org/r/1297237

Mentioned in SAL (#wikimedia-operations) [2026-06-10T13:29:37Z] <lucaswerkmeister-wmde@deploy1003> Started scap sync-world: Backport for [[gerrit:1297237|wmf-config: Update private subnets to include additions (T427393)]]

Mentioned in SAL (#wikimedia-operations) [2026-06-10T13:31:41Z] <lucaswerkmeister-wmde@deploy1003> lucaswerkmeister-wmde, brett: Backport for [[gerrit:1297237|wmf-config: Update private subnets to include additions (T427393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-06-10T13:36:58Z] <lucaswerkmeister-wmde@deploy1003> Finished scap sync-world: Backport for [[gerrit:1297237|wmf-config: Update private subnets to include additions (T427393)]] (duration: 07m 20s)