Page MenuHomePhabricator

codfw:(3) wikikube-ctrl NIC upgrade to 10G
Closed, ResolvedPublicRequest

Description

Quote/Hardware Request & Specifications

wikikube-ctrl hosts are network-bound, which is contributing to capacity issues (T366094). 10G NICs would resolve the issue.

Need By Date

This section should detail when the requesting group/team needs these online and accessible to them (OS installed and puppet calling in.)

If your request ties to a quarterly goal, please list links to that goal here.

Budget Details

Using spares from decom'd servers

Refresh / Replacement / Expanding / New Service

Upgrade to 10G NICs for:

  • wikikube-ctrl2001
  • wikikube-ctrl2002
  • wikikube-ctrl2003

Hostname / Racking / Installation Details

Hostnames:

  • wikikube-ctrl2001
  • wikikube-ctrl2002
  • wikikube-ctrl2003

Networking Setup: Speed:10G.
No changes otherwise.

Quote Review

This section will list/link to each quote for review.

Order Details

This section will be updated to list the order details.

Event Timeline

RobH moved this task from Backlog to Quote Requested on the procurement board.
RobH added projects: DC-Ops, ops-codfw, serviceops.
RobH subscribed.

WMF quote request for Dell USA - NIC upgrades to existing wikikube-ctrl fleet in eqiad - T366204

Dell Team,

We have three hosts in our Ashburn facility we would like to install Broadcom 57414 Dual Port 10/25GbE NICs into these three older hosts.

R440 service tags: 9WXL4Z2, DFRB8B3, DHH98B3

Ideally we'd like to install (1) Broadcom 57414 Dual Port 10/25GbE NIC per host. However, if the 57414 isn't available for these, we can go with the Broadcom 57416 or 57412 (depending on what you can sell to install into these older R440s).

Would you please check for one of the above Broadcom NICs to quote (3) for installation into these hosts?

  • Site: eqiad (Ashburn)
  • Ref: T366204

Thanks in advance,

Dell Team,

Actually, we need a second quote for this exact same thing for our servers in our other site codfw.

So updating this request for 2 quotes:
Quote 1

  • eqiad service tags: 9WXL4Z2, DFRB8B3, DHH98B3
  • (3) Broadcom 57414 Dual Port 10/25GbE NIC per
  • Site: eqiad (Ashburn, VA)
  • Ref: T366204

Quote 2

  • codfw service tags: 1JYB613, 1JXB613, GZ298B3
  • (3) Broadcom 57414 Dual Port 10/25GbE NIC per
  • Site: codfw (Carolton, TX)
  • Ref: T366205

Thanks in advance!

wiki_willy removed a project: procurement.
wiki_willy added subscribers: Papaul, wiki_willy.

Removing the procurement tag, since we have 10g cards available from decom'd hosts. @Papaul - can you work with @kamila on getting these upgraded and migrated to 10g switches (if needed)? Thanks, Willy

wiki_willy shifted this object from the Restricted Space space to the S1 Public space.May 29 2024, 6:48 PM
wiki_willy removed a project: procurement.
wiki_willy moved this task from Procurement to Backlog on the ops-codfw board.
wiki_willy updated the task description. (Show Details)
wiki_willy removed a subscriber: RobH.

@Papaul could you please let me know when would be a good time for you to do this? We don't have any specific time requirements, just that earlier would be nice. I would like to do it in two steps (first just 1 server, then the other 2) for capacity reasons. I am in CEST, so US mornings would work well for me, but I can also decom in advance and reimage the next day if you'd prefer to do it asynchronously, just let me know.

@kamila ? There are some planning that we need to do around this.
We will need to relocate those servers for you to be able to use the 10G interface on the switch side
Are you planning on re-imaging the once in row A and B to the new per/rack vlan? Because we will have to move those nodes to another rack, within the same row so will can utilize the block of 4 interfaces configuration on the switch side. Right how we don't have U space in those racks to move those servers within the same rack.
The one in row C we will have to move to a 10G rack since right now we have it in in C6 which is a 1G rack.
Please provide me with you plan and we can work on making it work best for you. Thanks
@Jhancock.wm please keep an eye on this we will have to do some servers relocations.

proposed relocations for each server. lemme know if that works for you @Papaul

wikikube-ctrl2001
current rack/U: B6-U13
proposed: B7-U43

wikikube-ctrl2002
current rack/U: C6-U31
proposed: C7-U38

wikikube-ctrl2003
current rack/U: A3-U15
proposed: A2-U42

@Papaul Thanks for the additional details!

I think moving to the new per-rack VLAN shouldn't be a problem for us. I am planning to reimage the machines anyway because of the move.

@Jhancock.wm 's relocation proposal seems reasonable to me, so if it works for you, I'm good with it.

I don't really have constraints other than "fully upgrade + move + re-IP + allow time for reimaging and re-pooling one server before moving to the other two". I don't care about ordering, whatever is easiest for you. I also don't have specific time requirements, other than having advance notice so I can shut the nodes down gracefully and then having time to finish up the first machine and verify things are working before proceeding. (I will be doing my part during more or less EU daytime, so some waiting due to timezone differences might pop up depending on specific timing.)

What other information do you need from my side?

Thanks!

@kamila your plan works for us as well, just depool and power the first server you want to move in your time zone and let us know which one and when we are on site in our time zone we will put in the 10G nic, move it to the new rack and do all the Netbox changes and hang it back to you for re-images. Once you are happy with it we can move to the next one.

Thanks

@kamila I think the cleanest way maybe for this will be to decom the server. let me know what you think.

@Jhancock.wm thanks for the relocation proposal. Works for me.

@kamila your plan works for us as well, just depool and power the first server you want to move in your time zone and let us know which one and when we are on site in our time zone we will put in the 10G nic, move it to the new rack and do all the Netbox changes and hang it back to you for re-images. Once you are happy with it we can move to the next one.

That sounds good to me. I have decommed wikikube-ctrl2003, it's ready to go. (Note that I started with 2003, not the first one, sorry about that.)

@kamila no problem we can move that one. Once done we will update the task.

@kamila 2003 is ready for to reimage.

I don't understand why the need to be moved to get upgraded to 10G. If we take for example wikikube-ctrl2001 the switch in rack B6 have plenty of available/ready to use 10G ports (for example 44-47).

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2003.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2003.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl2003.codfw.wmnet with OS bullseye completed:

  • wikikube-ctrl2003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406171123_kamila_3090519_wikikube-ctrl2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

@ayounsi wikikube-ctrl2001 is racked on u13 if we move it and plug it in port 44-47 it will mess up the hard working we have be doing in codfw to match the U space to he switch port. In codfw if a server is racked in U1 it needs to be connected to [xe-ge]-/0/0/0. This approach makes it easy for onsite work.

cookbooks.sre.hosts.decommission executed by kamila@cumin1002 for hosts: wikikube-ctrl2001.codfw.wmnet

  • wikikube-ctrl2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by kamila@cumin1002 for hosts: wikikube-ctrl2002.codfw.wmnet

  • wikikube-ctrl2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

wikikube-ctrl2003 looks happy, thanks for the help!

I have decommed wikikube-ctrl2001 and 2002, they're good to go.

@kamila glad to know 2003 looks happy. We will move 2001 and 2002 today and let you know when there are ready for re-image.

All the netbox part is done waiting.

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye executed with errors:

  • wikikube-ctrl2002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-ctrl2002.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye

@kamila 2001 and 2002 are ready

papaul@lsw1-b7-codfw> show interfaces descriptions | match wiki*
xe-0/0/42       up    up   wikikube-ctrl2001
papaul@asw-c-codfw> show interfaces descriptions | match wikikube*
xe-7/0/36       up    up   wikikube-ctrl2002

Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl2001.codfw.wmnet with OS bullseye executed with errors:

  • wikikube-ctrl2001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202406191854_kamila_3592797_wikikube-ctrl2001.out, asking the operator what to do
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202406192015_kamila_3592797_wikikube-ctrl2001.out, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406200800_kamila_3592797_wikikube-ctrl2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-ctrl2001.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl2001.codfw.wmnet with OS bullseye executed with errors:

  • wikikube-ctrl2001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-ctrl2001.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl2001.codfw.wmnet with OS bullseye executed with errors:

  • wikikube-ctrl2001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-ctrl2001.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye executed with errors:

  • wikikube-ctrl2002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202406191908_kamila_3593609_wikikube-ctrl2002.out, asking the operator what to do
    • First Puppet run failed and the operator aborted
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-ctrl2002.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye executed with errors:

  • wikikube-ctrl2002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-ctrl2002.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye executed with errors:

  • wikikube-ctrl2002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-ctrl2002.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye executed with errors:

  • wikikube-ctrl2002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-ctrl2002.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye completed:

  • wikikube-ctrl2002 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202406210902_kamila_180519_wikikube-ctrl2002.out, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406211015_kamila_180519_wikikube-ctrl2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Done, thanks a lot for the help @Papaul !