Page MenuHomePhabricator

Request additional mgmt IP range for frack servers
Closed, ResolvedPublic

Description

We may need an additional IP range for the frack servers in codfw. at least temporarily while upgrades are happening. we have two we can use as discussed in irc. We have several more servers coming that will need IPs. We are trying to coordinate decons and refreshes but it is very tight. please make adjustments to this ticket as needed. Please let me know what additional information you need for this.

Event Timeline

We will need to migrate the whole range to a new prefix :( Running 2 ranges is going to be a pain long term, and would need much more work on the automation side than a migration.

Rough steps, I think those are the less disruptive to prod :

  1. Grow the 10.195.0.0/24 to a /23 (make sure any filters/acls are updated)
  2. Allocate 10.195.1.0/26 to management
  3. Add 10.195.0.1/26 to pfw3-codfw:reth0.2140
  4. Update IDRAC on all the existing OOB to use the new prefix
  5. Decommission 10.195.0.96/27

We will need to migrate the whole range to a new prefix :( Running 2 ranges is going to be a pain long term, and would need much more work on the automation side than a migration.

Yeah agreed. It sucks but it is what it is.

Rough steps, I think those are the less disruptive to prod :

  1. Grow the 10.195.0.0/24 to a /23 (make sure any filters/acls are updated)
  2. Allocate 10.195.1.0/26 to management

To me that seems a little conservative. Could we instead:

  1. Grow the 10.195.0.0/24 allocation to a /21
  2. Allocate 10.195.2.0/23 to management (make sure any filters/acls are updated).
  3. Add 10.195.2.1/23 to pfw3-codfw:reth0.2140
  4. Update IDRAC on all the existing OOB to use the new prefix
  5. Decommission 10.195.0.96/27

Not that we should ever expect 500+ hosts on the mgmt network here. But doubling the size from 29 available IPs to 61 seems a little too prudent in my book. I can't really think of any downsides of the bigger allocation (still lots of space in the overall codfw allocation), and we'd be kicking ourselves in 5 years if for some reason we ended up having to renumber again. Anyway just a thought.

I do agree with the 2 options however there is a possibility too that Frack will be taking a new rack if we do the codfw expansion and how many more hosts we be added we don' know yet. /23 is like 510 hosts and i don't think frack will have that many hosts we will have at the end a lot of IP's that we will not use. Maybe a /25 =~126 hosts make more since for me.
Right now we have like close to 40 hosts in a rack and if we have a second rack that will be like another 42 hosts a total of 82 hosts for two racks and we still have IP's for the firewall and the switches

Or we could just use a IPv6 /64 and stop worrying about space :)

Thinking more globally, if we were to redo the production mgmt network, we would split it by smaller routed subnets to not have a huge /16 failure domain, so a /23 for frack mgmt seems like a bit too much.
A /25 seems like a fine middle ground. Not a hard blocker/preference for me though.

Or we could just use a IPv6 /64 and stop worrying about space :)

One day :)

Thinking more globally, if we were to redo the production mgmt network, we would split it by smaller routed subnets to not have a huge /16 failure domain, so a /23 for frack mgmt seems like a bit too much.

I would make a distinction between the number of hosts in a vlan, or devices it stretches across, and the size of the attached subnet. I don't disagree with maybe breaking the mgmt subnet up more, but at the same time we don't have close to 65,000 devices on it (that wouldn't work clearly). We've used a /16 for convenience in the addressing plan - not because we think we'll have anything close to that number of hosts. Similar with per-rack private vlan /24s etc.

My instincts are similar here, and to use at least a /24, for silly human-readable reasons. But a /25 is also going to be fine, it addresses my main fear which was somehow running out of space in future, so +1 from me if we want to do that.

ok +1 for /25 so we all okay thanks

joanna_borun triaged this task as High priority.

Mentioned in SAL (#wikimedia-operations) [2024-07-22T20:46:10Z] <topranks> applying additional address to pfw3-codfw reth0.2140 to provide space for new hosts (T370164)

Change #1056017 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Add include in 10/8 reverse zone for new frack codfw mgmt range

https://gerrit.wikimedia.org/r/1056017

Change #1056017 merged by Cathal Mooney:

[operations/dns@master] Add include in 10/8 reverse zone for new frack codfw mgmt range

https://gerrit.wikimedia.org/r/1056017

Change #1056026 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add new mgmt range for frack codfw to network defs

https://gerrit.wikimedia.org/r/1056026

OK I have allocated 10.195.1.0/25 in Netbox and configured 10.195.1.1 as a secondary IP on pfw3-codfw on the mgmt interface.

New IPs for host mgmt interfaces can now be assigned from this range and should work. We will need to set up a plan to renumber the existing hosts on 10.195.0.96/27 to new IPs on the new range.

Change #1056029 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Widen netmak for allowed in BGP prefixes codfw frack

https://gerrit.wikimedia.org/r/1056029

Change #1056029 merged by jenkins-bot:

[operations/homer/public@master] Widen netmak for allowed in BGP prefixes codfw frack

https://gerrit.wikimedia.org/r/1056029

Change #1056026 merged by Cathal Mooney:

[operations/puppet@production] Add new mgmt range for frack codfw to network defs

https://gerrit.wikimedia.org/r/1056026

Gonna close this one, I see hosts have been assigned to the new range and are reachable

cmooney@cumin1002:~$ ping pay-lb2001.mgmt.frack.codfw.wmnet 
PING pay-lb2001.mgmt.frack.codfw.wmnet (10.195.1.6) 56(84) bytes of data.
64 bytes from wmf11992.mgmt.frack.codfw.wmnet (10.195.1.6): icmp_seq=1 ttl=61 time=30.7 ms
64 bytes from wmf11992.mgmt.frack.codfw.wmnet (10.195.1.6): icmp_seq=2 ttl=61 time=30.8 ms

We cannot ssh into the /25 frack mgmt network. on the /27 we use to do
``
ssh -L 8000:10.195.0.196:443 frbast-codfw.wikimedia.org
``
to access the mgmt network and this is not work for the /25 maybe a firewall filter we forgot to set

ayounsi added a subscriber: Dwisehaupt.

@Dwisehaupt could you send a patch to add 10.195.1.0/25to subnet-administration-codfw in the pfw filters (alongside of the current 10.195.0.64/28) ?

Sorry for the delay, I was out last week.

This should be fixed. The connections to the mgmt subnet were previously masked by a wider /24 iptables rule on the bastion hosts. I have broken this out more explicitly in the configuration, added the new 10.195.1.0/25 range, pushed the changes, and reloaded iptables. (FYI, it was already added to the PFW config in T371137 and is even more fodder for us to move to aerleon or such for consistency among iptables and SRX config.)

@Papaul Let me know if you encounter any connectivity issues now.

@Dwisehaupt i will test when i get my key and let you know. Thanks.

Dwisehaupt claimed this task.
Dwisehaupt moved this task from Triage to Done on the fundraising-tech-ops board.

This looks like it's all complete and we are working well with the new hosts in the new range. Going to close this one.