Page MenuHomePhabricator

Transition codfw data persistence external storage (es) hosts to 10G
Closed, ResolvedPublic

Description

pc hosts were migrated to 10G (T378715) and we'd like to do the same with es hosts.

Steps:

  1. Depool the host, verify the idrac console works
  2. Find the 10G interface name in $ ip link
  3. Edit /etc/network/interfaces replace the name of the 1G interface (eg. eno1) with the 10G one (eg. ens3f0np0)
  4. Power down the host
  5. Move it to U23 (so it can be connected to port 22)
  6. In Netbox:
    1. Move the two IPs to the new interface (use "add/assign IP on the new interface")
    2. Edit the cable to point to the new interface, change its color/type/ID if needed
    3. Rename ge-0/0/30 to xe-0/0/22 while changing its type to 10G
  7. Run homer on the switch sudo homer lsw1-XX-codfw* commit "Move XX to 10G TXXXX"
  8. Ensure a DNS cookbook run is NOOP
  9. Change the primary PXE NIC by running the provision cookbook
  10. Power up the host
  11. Verify connectivity, verify Puppet runs clean
  • es2035
  • es2036
  • es2037
  • es2038
  • es2039
  • es2040

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui triaged this task as Medium priority.Jul 18 2025, 7:02 AM
Marostegui moved this task from Triage to Ready on the DBA board.

According to https://netbox.wikimedia.org/dcim/devices/?q=es20
es2020 to es2025 are now offline.
es2026 to es2034 are almost 5 years old (2020-08-25)

So probably only es2035 to es2040 are worth taking care off.

According to https://netbox.wikimedia.org/dcim/devices/?q=es20
es2020 to es2025 are now offline.
es2026 to es2034 are almost 5 years old (2020-08-25)

So probably only es2035 to es2040 are worth taking care off.

Thank you - I copied the hostnames from the previous task, and I didn't check :)
I will adjust this task then!

Thank you

Are we doing these one at a time as well? We can start scheduling them this week. I plan to be onsite every day this week.

Yeah, we'll do one at the time, to be on the safe side. Do you want me to get es2035 ready for tomorrow?

yeah that would work for me, ty!

Mentioned in SAL (#wikimedia-operations) [2025-07-22T06:54:55Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool es2035 T399927', diff saved to https://phabricator.wikimedia.org/P79550 and previous config saved to /var/cache/conftool/dbconfig/20250722-065454-root.json

es2035 has been updated. Good news, I didn't have to physically move that one. There were no other 1G servers in the region on the switch that it occupied.
-10G cable has been ran.
-netbox entries have been updated.
-ran netbox and switch-interface cookbooks
-updated PXE boot in bios

The port speed for that range on the switch will likely need to be updated. someone with root access will need to do that. And likely a homer run. Server is on and ready for you.

We can do es2036 tomorrow if you'd like.

Ran homer and manually removed the config forcing it at 1G, host is up.

@Marostegui lemme know when you want to do es2036

I can have it ready today if you'd like

Mentioned in SAL (#wikimedia-operations) [2025-07-24T16:34:39Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool es2036 T399927', diff saved to https://phabricator.wikimedia.org/P79855 and previous config saved to /var/cache/conftool/dbconfig/20250724-163439-root.json

@Marostegui today or tomorrow is fine.

es2036 is ready for you

es2036 done

[   21.582858] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Up, 10000 Mbps (NRZ) full duplex, Flow control: none
[   21.582868] bnxt_en 0000:4b:00.0 eno12399np0: FEC autoneg off encoding: None

@Jhancock.wm es2037 is ready for you - homer was run.

@Marostegui 2037 is moved and updated. All yours!

We can schedule 2038 for Monday or Tuesday if you want.

Thanks, I will get es2038 ready by Monday

@Marostegui es2038 is moved, updated, and powered up!

for es2039. it's not gonna fit cleanly into our racking scheme. There isn't anywhere in the rack i can put a 2U server that fits in with the U matching the switch port. So we can break convention or move it to an adjacent rack.

if we want to break convention, i can leave it in the physical location it's at and use a port on the switch that is otheriwse unusable (42 or 20)

or for moving to a different rack D6 would be the best candidate.

@Marostegui es2038 is moved, updated, and powered up!

Thank you!

for es2039. it's not gonna fit cleanly into our racking scheme. There isn't anywhere in the rack i can put a 2U server that fits in with the U matching the switch port. So we can break convention or move it to an adjacent rack.

if we want to break convention, i can leave it in the physical location it's at and use a port on the switch that is otheriwse unusable (42 or 20)

or for moving to a different rack D6 would be the best candidate.

I don't mind either way. However es2039 is now a master, so I have to switch it over (it is not a big deal). What is preferred from an on-site point of view and from a netops point of view @ayounsi?
It would be fine if the IP changes too, but I don't know how the whole process of changing the DNS underneath would work. From a database point of view, we'd need to know the new IP beforehand though.

You can probably skip 2039 for now and jump to 2040 until we figure out what's best for 2039.

For 2039 as they're still using the row wide IPs it could be moved to any rack in codfw row D. Following the same procedure as actually, but also running the sre.puppet.sync-netbox-hiera to update its rack fact.

But of course we would prefer to have the host re IP to per rack vlans : https://wikitech.wikimedia.org/wiki/Vlan_migration

From a database point of view, we'd need to know the new IP beforehand though.

That's a bit difficult as our automation picks the first available IP during provisioning. We could try to reserve an IP and un-reserve it right before provisioning so that formerly reserved IP becomes the first available. But we would need to make sure that nobody provisions another server at the same time.

You can probably skip 2039 for now and jump to 2040 until we figure out what's best for 2039.

Works for me - @Jhancock.wm works for you?

For 2039 as they're still using the row wide IPs it could be moved to any rack in codfw row D. Following the same procedure as actually, but also running the sre.puppet.sync-netbox-hiera to update its rack fact.

But of course we would prefer to have the host re IP to per rack vlans : https://wikitech.wikimedia.org/wiki/Vlan_migration

I have no preferences here, whatever makes more sense for you.

From a database point of view, we'd need to know the new IP beforehand though.

That's a bit difficult as our automation picks the first available IP during provisioning. We could try to reserve an IP and un-reserve it right before provisioning so that formerly reserved IP becomes the first available. But we would need to make sure that nobody provisions another server at the same time.

Ok, don't worry, I can live with knowing the IP afterwards, I will simply remove this host entirely from dbctl and then add it back once we know the IP. It is not such a very big deal.

since the server is relatively new, I'd prefer to move it to a new rack. If we leave it where it is, and break convention, it'll break convention for a few years. I'm fine with waiting until we can get a plan for es2039.

Let me know when you'd like to move es2040. It has no issues with being moved within the same rack.

@Marostegui do you have time to knock out es2040 this week?

Sorry. Manuel is out. es2040 is easily doable. For when do you want it done?

Talking to Papaul. I'm doing es2040 right now. Will do es2039 later.

Mentioned in SAL (#wikimedia-operations) [2025-08-21T14:30:40Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Depool es2040 T399927', diff saved to https://phabricator.wikimedia.org/P81661 and previous config saved to /var/cache/conftool/dbconfig/20250821-143039-ladsgroup.json

@Ladsgroup es2040 is done and update. Thank you

Thanks! Replication has caught up. I'm repooling it now.

@Ladsgroup
i have two proposals for es2039.

  1. we leave it where it is and use port 43 on the switch. It'll be using a port that would otherwise not be in use.
  2. we move it entirely to rack D6.

I'd personally prefer the first, but the second is probably just as doable. let me know if you have a preference.

Changing its rack would also allow us to change its IP to per rack vlans: https://wikitech.wikimedia.org/wiki/Vlan_migration

Mentioned in SAL (#wikimedia-operations) [2025-08-26T11:25:09Z] <ladsgroup@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on es1039.eqiad.wmnet with reason: Glow up (T399927)

Mentioned in SAL (#wikimedia-operations) [2025-08-26T11:25:26Z] <ladsgroup@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on es2039.codfw.wmnet with reason: Glow up (T399927)

es2039 is shut down and ready for you. I rather not move it for now. Let me know if I picked the wrong NIC interface.

@Ladsgroup es2039's cable port has been moved and all the nextbox entries have been updated. let us know if you need any further assistance with this last one.

Thank you. I started the replication and it looks like it's working fine (and quite fast too!!!!)

Repooling the host, if the homer is also run. We can close this ticket.

Jhancock.wm claimed this task.
Jhancock.wm updated the task description. (Show Details)

I looked at them and they seems to be random replicas in random sections. I think they probably need rebalanacing to reduce their load (or adding more replicas) rather on needing 10G. Specially since that would make them special so it'd be harder to move that replica to another section and so on. IMHO, we don't need that but let's ask Manuel when he is back in a couple of weeks?