Page MenuHomePhabricator

eqiad: rows C/D Upgrade Tracking
Closed, ResolvedPublic

Description

This task will serve as the overall parent tracking task for the migration of the networks stack on eqiad rows C and D.

Netbox Entries:
eqiad
eqiad racks
eqiad row c
eqiad row d

Each sub-team will have its own sub-task, allowing for their own scheduling of services within service groups during the migration.
Google Sheet listing of all affected hosts

Timeline

All items should be on site and staged for migration from the week starting Nov 3rd 2025.

Project Tracking Checklist

  • [john] all hw racked and staged in netbox
  • [rob] all sre sub-team tasks filed and dispatched for scheduling
  • [cathal] netbox scripting updated for migration
  • [rob & john] all sre sub-team tasks resolved with hosts migrated to new switches
  • [netops and john] T411781 - lvs1018: remove cross-rack links to rows A, C and D
  • [john or valerie] move all scs connections from nokia back to old juniper switch in each rack for juniper switch config removal by netops
  • [netops] old switch configurations wiped to factory default before power removal and unracking
  • [john] old switches removed from racks & netbox updated to list each switch offline without rack assignment then moved to storage and slated for recycling/destruction.

Migration Checklist Example

  • host migration details on Google Sheet listing of all affected hosts is followed.
  • depool host from services/maint mode/drain as detailed on google sheet
  • netbox script, homer run 1 against nokia stack, move cable, homer run 2 against juniper stack
  • repool host to services/end main mode / resume service as detailed on google sheet
  • hand back to service owner for return to service

Scope & Reasoning of Work

The Juniper switch stacks in eqiad rows C and D will be replaced with Nokia spine/switch configuration. This will additionally upgrade all racks to 10G capable, currently only 2 racks in C and 2 racks in D are 10G capable. All hosts will migrate from the old to the new switch without any change in port speed. If they are 1G before the migration, they'll leverage an SFP-T to be 1G post migration. If they were in a 10G rack before the migration, they'll remain on 10G simply moved to the new switch stack.

Associated Tasks

T409800 - Potential MAC dupe issue due to dual switch stacks

Related Objects

Event Timeline

RobH triaged this task as Medium priority.

@cmooney: What do you think is the best way to go about migrating these connections on upcoming C/D updates? The new switch will be online in the rack at the time of the migration of the host, but I mean the actual host/netbox steps?

Please feel free to update the checklist example in this task description. At worst, we'd remove the old connection and re-run network provision, but that wouldn't keep the same IP address and we likely want to keep the same IP correct? Also not sure what the ideal steps are within the hostname to migrate and if there is something simple I'm unaware of?

Considerations:

  • Ideally the IP doesn't change so the dns doesn't change.
  • Ideally no reimage is required.

@cmooney: What do you think is the best way to go about migrating these connections on upcoming C/D updates? The new switch will be online in the rack at the time of the migration of the host, but I mean the actual host/netbox steps?

I think we can plan it similar to how we did the migrations in codfw from the old to the new switches. So first we have the technical bits to get the new switches in place and configured, ready that servers can be moved. I'll take care of that side of things, there may also be some individual hosts that need to be moved first (similar to T348128).

Once that is complete we are in a position to move servers one-by-one, and to your question we won't need to reimage or change IPs. Also thankfully we can mix 1G and 10G ports on the Nokias so one less worry.

We tackled the previous ones on a rack-by-rack basis. We have a Netbox script which can do the work of moving all the host connections in Netbox from one switch to the other (here), however I'll need to open a task and make some small changes to it to support the Nokias (mostly just the different interface naming/numbering).

On the day it's typically:

  1. Run the script to move the host uplinks in Netbox
  2. Run homer against the new switch to configure its ports for the hosts to be moved
  3. Move the physical connections from old to new switch

We only have a few seconds downtime typically, but sometimes a link doesn't come up (bad cable, not seated fully etc). We co-ordinated it in the below task and sub-tasks for the last set of them, and used the spreadsheet for most of the planning:

T370630: Migrate codfw servers in rows C & D from legacy ASW to LSW

I can definitely help creating a similar spreadsheet again to help us plan the moves. The trickiest bit is probably co-ordinating with all the other SREs to make sure everything is downtimed and ready for each window, help with that side of things would be great.

@RobH @Jclark-ctr there is also another way we could try to approach this so may as well mention it now before we start planning.

Rack-by-rack is easiest for us as we think of the DC in those terms. But we could try to do it on a "role by role" basis. i.e. talk to each SRE sub-team first, and ask them to work with us to produce a schedule to move their own systems? I know that would probably suit the teams better, as the moves could be done in a way that better suited their clustering. I'm not sure exactly what it would look like but we could think about it, the rack-by-rack approach caused headaches for some teams the last time (say if a particular rack had a lot of one type of host).

Anyway just throwing it out there. Rack-by-rack worked ok too I'm not saying we can't do it that way.

@cmooney I’m flexible to try either way. Maybe a mix could work? We could start with roles that aren’t single points of failure and are already live in codfw from the switchover, then ask for services to volunteer. If we don’t get enough volunteers, we can always fall back to a rack-by-rack approach.

Starting with the non-single points of failure should also cut down the rack-by-rack portion quite a bit, so that phase might only be a handful of servers in the end.

This would definitely mean more work for onsite and networking, but it might get more buy-in compared to pure rack-by-rack—since otherwise we’d be asking service owners to be available every day for 3 weeks for just a few servers, instead of 1 focused day for the majority.

I overthought this, we should just move them with an SFP-T to the new port and worry about reimage and migration to full 10G later. My earlier comment assumed upgrading both the switch and the actual link, but that is scope creep on my part!

RobH updated the task description. (Show Details)

Google Sheet listing of all affected hosts has been updated to remove all decom hosts since its creation (John removed a bunch of hosts from the physical racks since I created the list) and we've worked to add columns auditing the migration process.

RobH updated the task description. (Show Details)

I accidentally pasted the Day 1 update on a subtask:

Day 1 of migrations update:

  • 58 hosts moved total, 242 left according to the google sheet
  • We focused on moving hosts that did not require any specific scheduling with service owners and had depool/repool directions we could follow.
  • We skipped ganeti hosts today as they take time to drain and can be more complex, stuck to the low hanging fruit.

Day 2 update:

  • 73 servers moved today, 169 servers remain.
  • We (again) focused on moving hosts that did not require any specific scheduling with service owners and had depool/repool directions we could follow.
    • Additionally we didn't move any type of service group today (Friday) that we did not previously move successfully yesterday (Thursday) so to avoid paging or urgency on Friday.

So we started with 300 hosts and now have 169 left.

@cmooney Few things we ran into

an-worker1136 Failed to ping after migration. changed cable port old and new showed link moved back to old switch. Will need you to check.

We Also skipped D3 Migration. Homer said ‘Failed.’ The output looked okay, but I didn’t trust it with the ‘Failed’ message.

ERROR:homer:Failed to commit on lsw1-d3-eqiad.mgmt.eqiad.wmnet: Error in path: .system.mirroring.mirroring-instance{.name=="cathaltest"}.mirror-destination.local
[FailedPrecondition] local-mirror-destination is only allowed on subinterfaces with type local-mirror-dest

ERROR:homer:Homer run had issues on 1 devices: ['lsw1-d3-eqiad.mgmt.eqiad.wmnet']

Day skip (Monday) : No Migrations, catch up day for other tasks for both Rob and John.
Tuesday Holiday doesn't count

Day 3 Update (Wednesday):

  • 44 Hosts migrated today (175 migrated total), 125 remain.
  • Remaining hosts are 1 of 3 groups:
    • Group 1: VM Hosts k8 and ganeti: Both of these host serivce groups have updates to their sub-tasks to include directions on how to determine the cluster association of a given host node for inclusion in the cookbook to drain/depool those given hosts. Once that is provided, Rob and John will be able to migrate the majority of these hosts.
    • Group 2: Hosts requiring a set date/time or coordination with service owners: These hosts have notes to schedule and sync with the service owners so a set time can be decided for the migration, or the host services re-balanced to allow for its migration at DC ops discretion.
    • Group 3: (2) restbase hosts remain because we shouldn't migrate them all in a single day.

Day 4 Update:

  • Moved all remaining ganeti hosts today
  • 17 hosts moved today (192 hosts total), 108 hosts remain.
  • All remaining hosts are either k8 hosts (i have clarification question in to its sub-task) or scheduled date/time hosts due to SPOF or master/replicant failover concerns.

Day 5 Update:

  • 31 hosts moved today, 77 hosts remain
  • got directions from Clement on how to move wikikube hosts effectively, moved half the wikikube in row C, will move rexst of them tomorrow.
  • moved a bunch of db hosts
  • Small OSPF error in homer where rack D6 lost its info due to homer run, problem root cause resolved

Day 6 Update:

  • 33 hosts moved today, 44 remain
  • all row c wikikube migrated, some of row D wikikube migrated
    • 23 wikikube hosts remain out of the 44 left to move
    • We should get all 23 knocked out on Day 8.
  • (3) pc hosts will be moved next week on 24, 25, or 26th. Manuel is out this Thursday-Friday.
  • (2) lvs hosts will be moved whenever @cmooney would like to schedule being around for it as they are a bit touchy. lvs1020 is backup to 1019, both must move so 1020 will move first.
  • other hosts left on list need details submitted (they've been asked on sub tasks this week) or scheduling.
    • every single task with pending hosts has been updated to set scheduling for the host migration

Day 7 Update:

  • 22 hosts moved today, 22 remain
    • all wikikube and aux host migrations completed
    • (3) pc hosts in disucssion with data-persistence on migrating them Friday, Monday, or Tuesday.
  • (2) lvs hosts will be moved after Cathal coordinates with Brett for patch submission next week
  • Still pending feedback for the remaining host scheduling from yesterdays subtask updates.

Day 8 Update:

  • 3 hosts moved, 19 remain - 300 hosts total at start of migration
  • John worked with Amir directly today to depool and migrate pc101[678] since the depool and repool time on them was short. This was the last of the data persistence hosts for migration, the sub-task has been resolved.
  • (2) lvs hosts will be moved after Cathal coordinates with Brett for patch submission next week
  • alert1002 will migrate on 2025-12-03 @ 13:00 GMT
  • Followed up directly with Ben Tullis via both email and then he pinged me on irc so they are now working out the scheduling for their 12 remaining hosts.
  • not planning on moving any other hosts today, as it is Friday and we don't want anything possibly complicating SRE's who have already gone offline for the weekend.

Day 9 Update:

  • 9 hosts moved, 10 remain - 300 hosts total at start of migration
  • John worked with Ben directly to migrate the (8) Data Platform hosts this AM.
  • The last (4) Data Platform hosts are scheduled for migration tomorrow.
  • Myself and John worked with Scott to get conf1009 migrated today.
  • (2) lvs hosts will be moved after Cathal coordinates with Brett for patch submission this week
  • alert1002 will migrate on 2025-12-03 @ 13:00 GMT T405946
  • (3) ServiceOps hosts remain: wikikube-ctrl1003, kafka-main100[89], feedback pending on sub-task T405950#11402388.

Day 10 Update:

  • 7 host Moved, 11 Remaining - 300 host at start of migration
  • John worked with Ben directly to migrate the (4) Data Platform hosts this AM.
  • Rob and John worked with claime on moving wikikube-ctrl1003, kafka-main100[89]
  • John Did physical audit of all old switches and all remaining active links Found 8 additional servers that where missed in original sheet opened ticket T411025
  • alert1002 will migrate on 2025-12-03 @ 13:00 GMT T405946

New host count:

7 host Moved, 11 Remaining - 308 host at start of migration (counting the 8 John audited and filed a task for)

Mentioned in SAL (#wikimedia-cloud) [2025-11-26T15:26:04Z] <dhinus> depool clouddb10[17-20] for network maintenance T404609

@Jclark-ctr clouddb10[17-20] are now depooled, but not downtimed. Can you please downtime them yourself when you migrate them? Otherwise I can add a long downtime to all of them.

Icinga downtime and Alertmanager silence (ID=80e83414-993e-4a63-b612-9625174481c7) set by fnegri@cumin1003 for 2:00:00 on 4 host(s) and their services with reason: moving to a new switch

clouddb[1017-1020].eqiad.wmnet

Day 11 Update:

  • 10 hosts moved, 3 remain out of 308 total hosts.
  • John did all the moves today working with Andrew.
  • Migrated 8 of the 8 WMCS hosts found and added to T411025, only cloudelastic1010 remains from WMCS.
  • lvs10[19,20] hosts pending migration scheduling with Cathal and Brett.
  • alert1002 scheduled for migration on 2025-12-03 @ 13:00 GMT

Day 12 Update:

  • 2 hosts migrated today, 307 of 308 hosts migrated total
  • alert1002 migration by john/keith/rob completed
  • lvs1020 migration by brett/cathal/john completed
  • lvs1019 will migrate tomorrow

Day 13 Update:

  • all hosts in rows C and D migrated
    • lvs1018 in row B has links into C and D need removal via T411781 before we can kill old switch stack.
  • 308 total hosts in C/D moved (plus 1 upcoming row B lvs update)
  • Once the lvs1018 is moved next week, this task can be handed to netops to confirm full wipe of config/data off old juniper switch stacks in C/D then John or Valerie can remove the physical switches and update their netbox entries to 'offline' until they're recycled.
RobH updated the task description. (Show Details)

lvs1018's links were removed from use yesterday, so this project is now on the steps:

  • [john or valerie] move all scs connections from nokia back to old juniper switch in each rack for juniper switch config removal by netops
  • [netops] old switch configurations wiped to factory default before power removal and unracking
  • [john] old switches removed from racks & netbox updated to list each switch offline without rack assignment then moved to storage and slated for recycling/destruction.
RobH mentioned this in Unknown Object (Task).Dec 15 2025, 2:01 PM
Jclark-ctr updated the task description. (Show Details)