Page MenuHomePhabricator

Possibility to transition some codfw data persistence hosts to 10G
Closed, ResolvedPublic

Description

Follow up from parent task T360297: Take advantage of 10Gb NICs in the new network stack

db[2136,2139-2182,2185-2189,2191-2195,2206-2220].codfw.wmnet
es[2020-2040].codfw.wmnet
pc[2011-2016].codfw.wmnet

All those hosts are currently using their 1G NIC while they also have a 10G NIC. Since we recently replaced their top of rack switches, it's now possible to use that 10G NIC.
Also, some are up for a refresh, for example es[2020-2025] are from 2019, so most likely not worth tackling here.

Please let us know if you would be interested in doing the transition. This is not mandatory, on a Netops point of view they can stay on 1G until they get refreshed. Knowing if that's something useful to you (and for how many servers) will help us prioritize the work as well as write appropriate automation.

Depending on the servers placement, they might need to be physically re-racked in the same rack (more details on the task). No re-image is needed, but short downtime required.

It can also be an opportunity to transition them out of their legacy vlans (more info in https://wikitech.wikimedia.org/wiki/Vlan_migration ).

If you prefer to keep them at 1G, feel free to close the task.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cool, nothing urgent, in that case please let you know when you can which hosts that you want to migrate (or the ones that are not worth it), we can then figure out a plan of attack.

I assume no IP changes would happen right?

I assume no IP changes would happen right?

That's correct.

I assume no IP changes would happen right?

That's correct.

Let's start with pc* hosts'

Cool, let's start with pc2011 to validate the workflow, then if all good we can iterate faster on the other hosts.

  1. Depool the host, verify the idrac console works
  2. Find the 10G interface name in $ ip link
  3. Edit /etc/network/interfaces replace the name of the 1G interface (eg. eno1) with the 10G one (eg. ens3f0np0)
  4. Power down the host
  5. Move it to U23 (so it can be connected to port 22)
  6. In Netbox:
    1. Move the two IPs to the new interface (use "add/assign IP on the new interface")
    2. Edit the cable to point to the new interface, change its color/type/ID if needed
    3. Rename ge-0/0/30 to xe-0/0/22 while changing its type to 10G
  7. Run homer on the switch lsw1-a5-codfw
  8. Ensure a DNS cookbook run is NOOP
  9. Change the primary PXE NIC by running the provision cookbook
  10. Power up the host
  11. Verify connectivity, verify Puppet runs clean

@Marostegui after the summit, if you can take care of the depool, I can sync up with @Papaul for all the other steps, depending on Papaul's schedule.

I am happy to resume this work whenever you guys want.

@Jhancock.wm would it be possible to find a slot to work on that ?

Hey, I can help move the ones already planned on wednesday or thursday.

I can also look into the list at large and see if there are some we can change to 10G that won't have to be moved (i.e. 4 of them clustered together)

I think we agreed on pc2011 for now (we can't move all the 4 pc at once), so I can get pc2011 ready for you to move it on Wed. Let me know if that's ok. Thanks!

sorry clarification. I'll be looking for groupings of servers that we don't need to physically move to get on the 10G. Just updating the physical link and the config on the switch. there should be some to make things easier on everyone in the list.

@Jhancock.wm sure, what I mean is that I cannot depool all pc* hosts at once, that's why I suggested to start with pc2011 to make sure it all goes as we plan.

Mentioned in SAL (#wikimedia-operations) [2025-06-10T13:32:10Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool pc1 T378715', diff saved to https://phabricator.wikimedia.org/P77524 and previous config saved to /var/cache/conftool/dbconfig/20250610-133207-marostegui.json

pc2011 can be moved anytime now. It is off.

Mentioned in SAL (#wikimedia-operations) [2025-06-10T14:51:37Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Repool pc1 T378715', diff saved to https://phabricator.wikimedia.org/P77541 and previous config saved to /var/cache/conftool/dbconfig/20250610-145137-marostegui.json

Everything went fine here! What's the next one ? :)

We can do:
pc2012, pc2013, pc2014, pc2015, pc2016, pc2017 and pc2018

@Jhancock.wm let me know if those new U/ports allocation work for you.

pc2012 - U25 - xe-0/0/24
pc2013 - U24 - xe-0/0/23
pc2014 - U4 - xe-0/0/3
pc2015 - U24 - xe-0/0/23
pc2016 - U41 - xe-0/0/40
pc2017 - already at 10G
pc2018 - already at 10G

We cannot do them at the same time, but we can do one per day.

@Marostegui i am free to do a move on thursday and friday mornings. exact time slot is up to you. I should be onsite starting at 1400 UTC (9am CDT) on both days. we can move pc2012 on thursday and pc2013 on friday if you have time.

Thanks! I'll get pc2012 ready for tomorrow!

Mentioned in SAL (#wikimedia-operations) [2025-06-19T05:07:26Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool pc2 T378715', diff saved to https://phabricator.wikimedia.org/P78392 and previous config saved to /var/cache/conftool/dbconfig/20250619-050725-root.json

@Marostegui everything has been done with two notes.

  • I can't do a homer run (lack permission), so that will still need to be done.
  • i used port 23 on the switch because i had something installed in the originally picked spot.

other than that, everything is updated in the bios and the server is ready to boot.
We can move pc2013 on friday if you want to.

  • I can't do a homer run (lack permission), so that will still need to be done.

As discussed you ran the configure-switch-interfaces cookbook which is all that is needed here. The switch is properly configured now on the new port.

pc2012 is still down and it paged.

I've powered it up myself and I can see the prompt, but I guess there's still something missing within the network as the host isn't reachable.

cumin1002:~# ping pc2012.codfw.wmnet -c5
PING pc2012.codfw.wmnet (10.192.16.55) 56(84) bytes of data.

--- pc2012.codfw.wmnet ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4075ms

After talking to @ayounsi looks like the network/interfaces step was missing.
I've done it via mgmt console and the host is back.

Mentioned in SAL (#wikimedia-operations) [2025-06-20T07:59:44Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Repool pc2 T378715', diff saved to https://phabricator.wikimedia.org/P78514 and previous config saved to /var/cache/conftool/dbconfig/20250620-075944-root.json

@Marostegui do you want to do another server tomorrow?

@Marostegui do you want to do another server tomorrow?

I will be out for the whole week. Let's sync up when I am back!

@Marostegui I'm ready to take care of more of these this week, when you are ready

@Marostegui I'm ready to take care of more of these this week, when you are ready

Manuel is out this week too. I can help. Do you need them shut down too?

Thanks, steps are listed on T378715#10524038
At least steps 1, 2, 3, 4, 7, 11 for the service owner, everything else for DCops

Mentioned in SAL (#wikimedia-operations) [2025-07-01T10:07:30Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Depool pc3 T378715', diff saved to https://phabricator.wikimedia.org/P78729 and previous config saved to /var/cache/conftool/dbconfig/20250701-100729-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-07-01T10:08:34Z] <ladsgroup@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Switch to 10G (T378715)

pc2013 is ready, One thing is that I saw two interfaces with 10G:

root@pc2013:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens3f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether e4:3d:1a:0b:73:e0 brd ff:ff:ff:ff:ff:ff
    altname enp175s0f0np0
3: ens3f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether e4:3d:1a:0b:73:e1 brd ff:ff:ff:ff:ff:ff
    altname enp175s0f1np1
4: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 2c:ea:7f:a4:09:02 brd ff:ff:ff:ff:ff:ff
    altname enp4s0f0
5: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 2c:ea:7f:a4:09:03 brd ff:ff:ff:ff:ff:ff
    altname enp4s0f1

Both ens3f0np0 and ens3f1np1 seems to be 10G (10000baseT/Full)

I went with ens3f0np0

pc2013 has been moved.
U18
xe-0/0/17
port description updated. ips moved. server is powering up.
ran the switch-interface cookbook instead of a homer run.
@Ladsgroup all yours. also that was the correct port =)

let me know if y'all want to do another one this week or if you'd like to wait. it is unlikely I will be on site on Thursday or Friday.

Mentioned in SAL (#wikimedia-operations) [2025-07-01T16:44:06Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Repool pc3 T378715', diff saved to https://phabricator.wikimedia.org/P78734 and previous config saved to /var/cache/conftool/dbconfig/20250701-164405-ladsgroup.json

pc2013 is fully done, I'll do pc2014 tomorrow morning

Mentioned in SAL (#wikimedia-operations) [2025-07-02T06:15:18Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Depool pc4 T378715', diff saved to https://phabricator.wikimedia.org/P78735 and previous config saved to /var/cache/conftool/dbconfig/20250702-061517-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-07-02T06:28:22Z] <ladsgroup@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Switch to 10G (T378715)

pc2014 has been shut down and ready for dcops

pc1014 can't get replication from pc2014 because Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in binary log index file'

I think since more than a day has been passed, now the binlog has been rotated out. Fixing it isn't hard but a bit annoying

Mentioned in SAL (#wikimedia-operations) [2025-07-03T07:52:26Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Repool pc4 T378715', diff saved to https://phabricator.wikimedia.org/P78744 and previous config saved to /var/cache/conftool/dbconfig/20250703-075225-ladsgroup.json

When do you have time to do pc2015?

i have time to do pc2015 tomorrow.

Sounds good, I do the usual stuff tomorrow morning and hand it over to you.

We will do it next week instead.

@Jhancock.wm let me know which day works for you for pc2015 next week.

I think we both forgot about this one. is pc2015 ready and i missed it? or should we try for tomorrow?

I even set up a calendar for it but I missed it. I will try to have it ready for tomorrow, sorry about it.

@Jhancock.wm pc2015 is now down. The interfaces file was changed. Let me know when you are done from your side so I can run homer

Mentioned in SAL (#wikimedia-operations) [2025-07-16T06:19:27Z] <marostegui> Poweroff pc2015 for 10G migration T378715

@Marostegui server has been moved, entries updated in netbox. bios settings updated. i ran the dns cookbook and the switch-interface cookbook but server isn't pingable. does it need a homer run? lmk if you need me to do anything on this end.

@Jhancock.wm I just ran it, but I am not sure it worked, maybe @cmooney or @ayounsi can tell me what am I missing:

[15:38:41] marostegui@cumin1002:~$ sudo homer lsw1-c5-codfw* commit "Move pc2015 to 10G T378715"
WARNING:homer.capirca:Netbox capirca.GetHosts script is > 3 days old.
INFO:homer.devices:Initialized 106 devices
INFO:homer:Committing config for query lsw1-c5-codfw* with message: Move pc2015 to 10G T378715
INFO:homer.devices:Matched 1 device(s) for query 'lsw1-c5-codfw*'
INFO:homer:Generating configuration for lsw1-c5-codfw.mgmt.codfw.wmnet
Change for lsw1-c5-codfw.mgmt.codfw.wmnet:

[edit interfaces]
-   ge-0/0/32 {
-       description pc2015;
-       mtu 9192;
-       unit 0 {
-           family ethernet-switching {
-               interface-mode access;
-               vlan {
-                   members private1-c-codfw;
-               }
-           }
-       }
-   }
[edit class-of-service interfaces]
-    ge-0/0/32 {
-        scheduler-map wmf_map;
-        unit 0 {
-            classifiers {
-                dscp v4_classifier;
-                dscp-ipv6 v6_classifier;
-            }
-        }
-    }
[edit protocols sflow]
-    interfaces ge-0/0/32.0;

Type "yes" or "no" to commit or abort the commit for this device, "all" or "none" to commit or abort the commit for this device and all next devices with the same diff.
> yes
INFO:homer.transports.junos:Committing the change on lsw1-c5-codfw.mgmt.codfw.wmnet
INFO:homer:Homer run completed successfully on 1 devices: ['lsw1-c5-codfw.mgmt.codfw.wmnet']

I thought this step had to be completed once you did your part, so maybe it is all messed up and we have to start again?

@Jhancock.wm I have verified via idrac that /etc/network/interfaces was looking good. And just powered off the host again. So maybe we can resume again at point #8 from T378715#10524038

pc2015 is connected to port xe-0/0/23 which was in a group still configured at 1G:

[edit chassis fpc 0 pic 0]
-     port 20 {
-         speed 1G;
-     }
[edit interfaces]
-   ge-0/0/20 {
-       description DISABLED;
-       disable;
-   }

That's an automation issue and will be fixed with https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/1167564

I fixed it manually. That port is now properly configured but is down.

It should be all good when you bring the host back up.

pc2015 is connected to port xe-0/0/23 which was in a group still configured at 1G:

[edit chassis fpc 0 pic 0]
-     port 20 {
-         speed 1G;
-     }
[edit interfaces]
-   ge-0/0/20 {
-       description DISABLED;
-       disable;
-   }

That's an automation issue and will be fixed with https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/1167564

I fixed it manually. That port is now properly configured but is down.

It should be all good when you bring the host back up.

Thank you!
For my knowledge, which step was missed/not done correctly?

Host is up:

[   19.547927] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Up, 10000 Mbps (NRZ) full duplex, Flow control: none

For my knowledge, which step was missed/not done correctly?

The steps were done correctly, but there is an edge case that is improperly handled by our automation. If the new 10G port is set in a 4 ports groups configured for 1G. The patch above will clear that issue in the long term.
I checked pc2016 and the issue shouldn't happen there.

I will have it ready for tomorrow.

@Jhancock.wm pc2016 is now off. I also ran homer already.

@Marostegui server had been moved and everything updated. looks like it still needs a little push. can't get it to ping but everything checks out.

pc2016 is back up.
@Jhancock.wm let's do pc2017 on Monday?

pc2017 and 2018 are already at 10G (T378715#10902792) so we're good for the pc*. Nice job everyone !
Do you want to migrate the DB or ES hosts too (listed in the task description) ?

I'd be happy to start es hosts (probably we can create an specific task for it) - @jcrespo this would be also good for backups speeds correct? Any objections there?

Closing this as completed - will create a task for es hosts.
Thanks everyone for the help

Change #1091250 abandoned by Arnaudb:

[operations/software@master] dbtools: command line helper to evaluate a host, or a group of hosts

https://gerrit.wikimedia.org/r/1091250