Page MenuHomePhabricator

eqiad row C/D Infrastructure Foundations host migrations
Closed, ResolvedPublic

Description

The network stacks in eqiad rows C and D are being upgraded to all 10G capable switches. Part of this migration will require all systems on the old switches to be moved to the new switch stack.

In previous migrations, we've stepped through the racks on by one, requiring each sub-team to be present for all affected hosts on the day of the migration. In an effort to better scale with the needs and schedules of multiple teams, we're planning to do this migration slightly different. Rather than a single date for each rack, we're providing a listing to each sub-team of all affected hosts, and that sub-team can then provide feedback with the priority and scheduling of the migration of hosts.

Scheduling Options and Considerations:

  • Provide priority groups for the hosts below, and we can move group 1, then 2, etc...
  • Provide specific dates and times for the migrations and we can coordinate the migration of the required host(s)
  • A mix of the above for easier hosts could be in groups where high priority or critical hosts could have specific date/times set.

The checklist for each hosts migration steps are being developed and won't be pasted in to each task for each host in advance of the move (since if there is an adjustment it is a lot of tasks to update.)

The host list is also available on the Google Sheet listing of all affected hosts.

Host(s) List:
aux-k8s-worker1006 D4
aux-k8s-worker1007 D7
bast1003 D1
ganeti1024 C5
ganeti1027 C4
ganeti1028 C7
ganeti1033 D2
ganeti1034 D4
ganeti1037 C7
ganeti1038 D4
ganeti1045 C4
ganeti1046 C4
ganeti1047 C7
ganeti1048 C7
ganeti1049 D2
ganeti1050 D2
ganeti1051 D7
ganeti1052 D7
maps1009 C3 (decommissioned)
maps1010 D3 (decommissioned)
maps1013 C2
maps1014 D2
pki-root1001 D3
pki1002 C6
puppetserver1001 C5
krb1002 C3

Event Timeline

Joanna,

I'm not exactly sure who on your team to assign this as point of contact, so I'm assigning to you as team manager to reassign as needed. Apologies for the extra step, I just don't want to step on any toes assigning this to the incorrect person.

We need to work with someone on the migration of the above host list. The actual downtime should only be a minute or two per host as the netbox/homer scripts run and the physical connection is moved from the old switch to the new switch (mounted directly adjacent to the old switch).

Please assign to whomever I should coordinate with for these migrations, thank you in advance!

cmooney triaged this task as Medium priority.Oct 6 2025, 2:37 PM

@LSobanski,

Is there anything else I can provide to assist in getting feedback on the host list in the task description for their network migrations onto the new network stack?

The service groups seem to be as follows:

aux-k8s-worker100[67]:
bast1003:
ganeti10[24-52]:
maps10[09-14]:
pki-root1001:
pki1002:
puppetserver1001:
krb1002:

Please update the task description with the date/time or planning required for each host to migrate. A great example from another sub-team is T405943, which has each host and if it requires scheduling or can be moved without other steps.

Each host will experience a minute or two of network connectivity loss as its migrated from the port on the old switch stack to the port on the new switch stack. No functional changes to the hosts are required, only a note of if they require depooling and downtime (and scheduling) or if simply moving the cable is enough.

This work is planned to start on October 15th and continue through the end of the month. Your feedback is required for the successful completion of this project.

Please advise.

LSobanski added subscribers: cmooney, LSobanski.

@RobH here's a summary of what needs to happen with the hosts, @cmooney will be coordinating the specifics:

  • aux-k8s-worker100[67]: can be drained at any time, we have a cookbook
  • bast1003 D1: announce on IRC, avoid large maintenance windows
  • ganeti10[24-52]:
    • ganeti1024 C5, ganeti1027 C4: have legacy NICs so drain slower (15 minutes)
    • all others: drain fast (2 minutes)
  • maps10[09-14]:
    • maps1009, maps1010: may be gone (decommissioned) if we wait a bit, worst case scenario we depool eqiad
    • maps1013, maps1014: replicas, no action needed
  • pki-root1001: no action needed
  • pki1002: no action needed
  • puppetserver1001: no action needed, heads up so people don't run Puppet
  • krb1002: no action needed

@RobH here's a summary of what needs to happen with the hosts, @cmooney will be coordinating the specifics:

  • aux-k8s-worker100[67]: can be drained at any time, we have a cookbook
  • bast1003 D1: announce on IRC, avoid large maintenance windows
  • ganeti10[24-52]:
    • ganeti1024 C5, ganeti1027 C4: have legacy NICs so drain slower (15 minutes)
    • all others: drain fast (2 minutes)
  • maps10[09-14]:
    • maps1009, maps1010: may be gone (decommissioned) if we wait a bit, worst case scenario we depool eqiad
    • maps1013, maps1014: replicas, no action needed
  • pki-root1001: no action needed
  • pki1002: no action needed
  • puppetserver1001: no action needed, heads up so people don't run Puppet
  • krb1002: no action needed

I've updated the google sheet: https://docs.google.com/spreadsheets/d/13ow4JxrsQdz8KSsdBBNwvlrAuGKo8OHWcnR4RhXTYc0/edit?usp=sharing

For the 'no action needed' I've noted just move the port with no OS/Icinga/service group actions. For the ganeti hosts with the requirement to depool them, is that somerthing you want to. handle or should DC ops handle at the time of migration? If us, can you detail here the steps we need to follow?

We'll depool batches of servers which can be switched over. It totally depends on the VMs the nodes are running, for some we can also simply not depool them and just update the cables and accept a brief network outage.

Please note this migration has shifted from Oct 15th start date to Nov 1 start date.

Day 1 of migrations update:

  • 58 hosts moved total, 242 left according to the google sheet
  • We focused on moving hosts that did not require any specific scheduling with service owners and had depool/repool directions we could follow.
  • We skipped ganeti hosts today as they take time to drain and can be more complex, stuck to the low hanging fruit.

@RobH ganeti1024 and ganeti1033 are drained and can be migrated.

@RobH ganeti1024 and ganeti1033 are drained and can be migrated.

Migration complete for these two hosts

@RobH ganeti1024 and ganeti1033 are drained and can be migrated.

Mortiz,

When we discussed draining ganeti nodes in IRC, I didn't realize the command requires the cluster and I'm uncertain how to determine what cluster a given ganeti node is in?

The command we disucssed in irc: sre.ganeti.drain-node

sudo cookbook sre.ganeti.drain-node ganeti1045.eqiad.wmnet
usage: cookbook [GLOBAL_ARGS] sre.ganeti.drain-node [-h] [-t TASK_ID] --cluster CLUSTER [--group GROUP] [--full] [--reboot] node
cookbook [GLOBAL_ARGS] sre.ganeti.drain-node: error: the following arguments are required: --cluster

The wikitech page only mentions how to shutdown a node using a wholly different command: https://wikitech.wikimedia.org/wiki/Ganeti#Reboot/Shutdown_for_maintenance_a_node and doesn't mention how to query what cluster a given node is within. Can you please update via comment the commands needed to determine a ganeti nodes clusters so I can include that in my drain-node command?

@RobH I've drained the two next hosts: ganeti1027 and ganeti1034 can be migrated next.

When these are done and you want to drain the next ones: The list of virtualisation groups is in Netbox: https://netbox.wikimedia.org/virtualization/clusters/

For our current migration in eqiad, the clustername is always eqiad and the group either C or D. E.g. the command for ganeti1034 was:

cookbook sre.ganeti.drain-node --cluster eqiad --group D ganeti1034.eqiad.wmnet

This needs to be run in a screen/tmux session.

(It's a little more complex in codfw, which also has a test cluster and the routed Ganeti cluster)

ganeti1028
ganeti1047
ganeti1048
ganeti1037
ganeti1045
ganeti1046
All migrated to new switch port after having the drain command run successfully against each.

Still more to move, just updating as I go.

All ganeti hosts migrated to their new switch ports in eqiad rows c/d

@LSobanski,

The only two Infrastructure-Foundations hosts left to migrate are

  • aux-k8s-worker100[67]: can be drained at any time, we have a cookbook

However, this doesn't quite make it clear if I should run the cookbook and move whenever works for me, or if your team should run the cookbook and you want to set a date/time to work with us or have us ping you via irc?

If it is something I can run the cookbook, can you advise exactly what cookbook and flags I'd run to migrate each of these two hosts?

Mentioned in SAL (#wikimedia-operations) [2025-11-14T06:54:37Z] <moritzm> rebalance eqiad/C following switch migration T405945

Mentioned in SAL (#wikimedia-operations) [2025-11-14T12:47:34Z] <moritzm> rebalance eqiad/D following switch migration T405945

RobH raised the priority of this task from Medium to High.Nov 17 2025, 4:23 PM

@LSobanski,

The only two Infrastructure-Foundations hosts left to migrate are

  • aux-k8s-worker100[67]: can be drained at any time, we have a cookbook

However, this doesn't quite make it clear if I should run the cookbook and move whenever works for me, or if your team should run the cookbook and you want to set a date/time to work with us or have us ping you via irc?

If it is something I can run the cookbook, can you advise exactly what cookbook and flags I'd run to migrate each of these two hosts?

Still looking for input on how to run the cookbook and move these last two hosts.

@RobH please proceed at your convenience -- these two hosts are not in active service.

They are still insetup:: role.

In the future, on other k8s worker hosts, you could run the cookbook like follows:

sudo cookbook sre.k8s.pool-depool-node -t T405945 --k8s-cluster aux-eqiad depool aux-k8s-worker1006.eqiad.wmnet

A depool should take only a couple of minutes, and the node is ready for maintenance as soon as the cookbook completes.

Once work is finished, replace depool with pool and repeat.

Thank you for the update, we'll likely move these two hosts tomorrow!

Please note we didn't get to these two today, will do tomorrow!

All Infrastructure-Foundations hosts in eqiad c/d rows migrated to the new switch stacks.