Page MenuHomePhabricator

Tracking List: Relocating servers to free up 10G switch space in codfw
Closed, ResolvedPublicRequest

Description

The new 10G switches have the port speeds delegated in groups of 4. This causes an issue where if we have a single 1G servers connected to a quadrant, it does not allow us to rack 10G servers around it. We've located the following servers as being the only 1G server in a quadrant. If we relocate all 21 servers in this list, we should be able to free up 84 ports for use as 10G.

-We only need to move them within rack. This eliminates needing to re-ip.
-We should be able to use the "Move a server within the same row" script in Netbox
-We want to schedule these for when works best for the service owners
-We have marked which ones are EoL. If they can be decommissioned, that would be even better, but is not required.

Server NameOwnerCurrent RackCurrent UEOLProposed RackProposed UScheduledIndividual Move ticket (if needed)
conf2005PENDC321noC344
mw2278JMeybohmD38yesD317going to be decommedT354791
cassandra-dev2001eevansB521noB540Completed
db2132MarosteguiA52yesA55decom completedT383697
db2166MarosteguiC540noC536Completed
db2189MarosteguiB837noB825Completed
dns2004sukhbirB83noB811Completed
ganeti2020MoritzB821noB813Compeleted
gerrit2002JeltoD812noD826Completed
kafka-main2010JMeybohmD633noD626CompletedT381788
maps2009MoritzB621noB630Completed
mw2259JMeybohmB39yesB330decom completedT384043
wikikube-worker2013JMeybohmA521noA519Completed
wikikube-worker2036JMeybohmB87noB812Completed
wikikube-worker2088JMeybohmB844noB826Completed
wikikube-worker2095JMeybohmB527noB533Completed
wikikube-worker2175JMeybohmD521noD531Completed
wikikube-worker2186JMeybohmD313noD318Completed
wikikube-worker2227JMeybohmD321yesD340Completed
wikikube-worker2229JMeybohmC618yesC631Completed
wikikube-worker2230JMeybohmC621yesC635Completed

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptJan 14 2025, 5:36 PM

@Jelto when do you think will be a best time for you or someone in your team to help us relocate some of those mw and wikikube-worker nodes? Thanks

@Jelto when do you think will be a best time for you or someone in your team to help us relocate some of those mw and wikikube-worker nodes? Thanks

I can mostly just speak for the gerrit2002 host, which can be moved at short notice because its a replica.

For the maintenance on the wikikube-worker and mw nodes we have to coordinate with serviceops.

@Jelto thanks please let us know when best works for you for the gerrit2002. Thanks

mw2259 and mw2278 are to be decommed (T354791, T384043)
mw2355 is now wikikube-worker2229 (T383862)
mw2356 is now wikikube-worker2230 (T383862)

All the wikikube-workers can be shut down on short notice (also multiple at once, so we could do batches).

Mentioned in SAL (#wikimedia-operations) [2025-01-22T11:10:20Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db2166 T383709', diff saved to https://phabricator.wikimedia.org/P72216 and previous config saved to /var/cache/conftool/dbconfig/20250122-111019-marostegui.json

Change #1113432 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2166: Disable notifications

https://gerrit.wikimedia.org/r/1113432

Change #1113432 merged by Marostegui:

[operations/puppet@production] db2166: Disable notifications

https://gerrit.wikimedia.org/r/1113432

Marostegui subscribed.

@Papaul db2166 can be moved anytime. The host has been powered off.

@Jelto thanks please let us know when best works for you for the gerrit2002. Thanks

How about Monday 27th Jan 15:30 UTC (09:30 CST)?

@Marostegui i'll move it this morning.

@Jelto That time works for us.

thank you both!

@Marostegui db2166 is moved, updated, and pinging.

Change #1113718 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2189: Disable notifications

https://gerrit.wikimedia.org/r/1113718

Mentioned in SAL (#wikimedia-operations) [2025-01-23T06:41:41Z] <marostegui> Powering off db2189 for onsite maintenance T383709

Change #1113718 merged by Marostegui:

[operations/puppet@production] db2189: Disable notifications

https://gerrit.wikimedia.org/r/1113718

Mentioned in SAL (#wikimedia-operations) [2025-01-23T06:42:42Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db2189 T383709', diff saved to https://phabricator.wikimedia.org/P72237 and previous config saved to /var/cache/conftool/dbconfig/20250123-064241-marostegui.json

@Marostegui db2189 is moved, updated, and pinging!

@Marostegui db2189 is moved, updated, and pinging!

Thanks! I will take it from here.

@JMeybohm, what do you think of this schedule for getting these moved?
wikikube-worker2229 Friday Jan 24 15:30UTC (9:30 local time)
wikikube-worker2230 Friday Jan 24 15:30UTC
wikikube-worker2227 Friday Jan 24 15:30UTC
wikikube-worker2013 Tuesday Jan 28 15:30UTC
wikikube-worker2036 Tuesday Jan 28 15:30UTC
wikikube-worker2088 Tuesday Jan 28 15:30UTC
wikikube-worker2095 Wednesday Jan 29 15:30UTC
wikikube-worker2175 Wednesday Jan 29 15:30UTC
wikikube-worker2186 Wednesday Jan 29 15:30UTC

@JMeybohm, what do you think of this schedule for getting these moved?
wikikube-worker2229 Friday Jan 24 15:30UTC (9:30 local time)
wikikube-worker2230 Friday Jan 24 15:30UTC
wikikube-worker2227 Friday Jan 24 15:30UTC
wikikube-worker2013 Tuesday Jan 28 15:30UTC
wikikube-worker2036 Tuesday Jan 28 15:30UTC
wikikube-worker2088 Tuesday Jan 28 15:30UTC
wikikube-worker2095 Wednesday Jan 29 15:30UTC
wikikube-worker2175 Wednesday Jan 29 15:30UTC
wikikube-worker2186 Wednesday Jan 29 15:30UTC

Works for me!

cassandra-dev2001 can be moved at your leisure (no coordination is needed).

depool host wikikube-worker[2227,2229-2230].codfw.wmnet by jayme@cumin1002 with reason: None

Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 depool for host wikikube-worker[2227,2229-2230].codfw.wmnet completed:

  • wikikube-worker[2227,2229-2230].codfw.wmnet (PASS)
    • Host wikikube-worker[2227,2229-2230].codfw.wmnet depooled from wikikube-codfw

pool host wikikube-worker[2227,2229-2230].codfw.wmnet by jayme@cumin1002 with reason: None

Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 pool for host wikikube-worker[2227,2229-2230].codfw.wmnet completed:

  • wikikube-worker[2227,2229-2230].codfw.wmnet (PASS)
    • Host wikikube-worker[2227,2229-2230].codfw.wmnet pooled in wikikube-codfw
Papaul added a subscriber: MoritzMuehlenhoff.

@Jhancock.wm you can move ganeti2020 anytime today. Once done just ping @MoritzMuehlenhoff . Thanks.

@MoritzMuehlenhoff ganeti2020 has been moved, updated, and pings.

depool host wikikube-worker[2013,2036,2088].codfw.wmnet by jayme@cumin1002 with reason: None

Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 depool for host wikikube-worker[2013,2036,2088].codfw.wmnet completed:

  • wikikube-worker[2013,2036,2088].codfw.wmnet (PASS)
    • Host wikikube-worker[2013,2036,2088].codfw.wmnet depooled from wikikube-codfw

@Jhancock.wm wikikube-worker[2013,2036,2088].codfw.wmnet have been shut down, lmk when you are done and they are back up so I can double check and repool

Mentioned in SAL (#wikimedia-operations) [2025-01-28T15:06:13Z] <jelto@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit2002.wikimedia.org with reason: NIC port switch -t T383709

Mentioned in SAL (#wikimedia-operations) [2025-01-28T16:00:33Z] <jelto@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit2002.wikimedia.org with reason: NIC port switch -t T383709

pool host wikikube-worker[2013,2036,2088].codfw.wmnet by jayme@cumin1002 with reason: None

Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 pool for host wikikube-worker[2013,2036,2088].codfw.wmnet completed:

  • wikikube-worker[2013,2036,2088].codfw.wmnet (PASS)
    • Host wikikube-worker[2013,2036,2088].codfw.wmnet pooled in wikikube-codfw

depool host wikikube-worker[2095,2175,2186].codfw.wmnet by jayme@cumin1002 with reason: None

Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 depool for host wikikube-worker[2095,2175,2186].codfw.wmnet completed:

  • wikikube-worker[2095,2175,2186].codfw.wmnet (PASS)
    • Host wikikube-worker[2095,2175,2186].codfw.wmnet depooled from wikikube-codfw

@Jhancock.wm wikikube-worker[2095,2175,2186].codfw.wmnet have been shut down, lmk when you are done and they are back up so I can double check and repool

Icinga downtime and Alertmanager silence (ID=c24ad8f7-3e57-4f83-8a1f-c507313e344f) set by jayme@cumin1002 for 1 day, 0:00:00 on 3 host(s) and their services with reason: extending downtime

wikikube-worker[2095,2175,2186].codfw.wmnet

pool host wikikube-worker[2095,2175,2186].codfw.wmnet by jayme@cumin1002 with reason: None

Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 pool for host wikikube-worker[2095,2175,2186].codfw.wmnet completed:

  • wikikube-worker[2095,2175,2186].codfw.wmnet (PASS)
    • Host wikikube-worker[2095,2175,2186].codfw.wmnet pooled in wikikube-codfw

@MoritzMuehlenhoff when is a good time next week to move maps2009? 1500 UTC on is when I'm normally on site.

@JMeybohm i think conf2005 is part of your team but i'm not certain. Please lmk if it is and, if so, when would be a good time to move that one.

Mentioned in SAL (#wikimedia-operations) [2025-02-26T15:28:27Z] <moritzm> depooled maps2009 for server move T383709

Mentioned in SAL (#wikimedia-operations) [2025-02-26T16:54:23Z] <moritzm> repooled maps2009 after completed server move T383709

@Scott_French would you be able to (or know who) could help me move conf2005 to clear up some space in the racks?

@Jhancock.wm - Thanks for flagging! Yes, I can help you with that. I'll open a task specifically for conf2005 later today, as the procedure is a bit involved.

@Scott_French honestly, since everything else went so well, we don't need to move it if it's very involved. I honestly did not expect to be able to move as many as we did. So i'm okay leaving this one where it is so as not to cause any interruptions.

Ah, that's good to know, @Jhancock.wm. If leaving it in place isn't causing any trouble, then I certainly have no objections to that option. On our end, it would probably be about 30m of work on both ends of the move (each of prep and cleanup), so it's not particularly onerous or anything - it just involves touching some things we frequently don't, so some extra care is needed.

we can leave it. The last server should be getting decommissioned soon enough so I'm going to consider this resolved.

Thank you everyone for your help! This really helped us out in codfw. Have a great weekend!