Page MenuHomePhabricator

replace onboard NIC in kafka-jumbo100[1-6]
Open, HighPublic

Description

This task will track the swap of the onboard NIC from the 1GB to the 10GB+1GB NICs being ordered via T227148.

Once the network cards are received in on T227148, this can proceed.

Re-image question: We will need an answer on this, will these be re-imaged? (perhaps @elukey will know) for the new network card? As it is an onboard network card, doing a re-image would result in the least unexpected issues from the change, but may not be something that can occur in quick succession across this fleet.

kakfa-jumbo1001:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.)
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move
  • - @RobH sets up network port with the above info. asw-a4-eqiad:39
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @RobH removes switch config for old 1g port asw2-a-eqiad:ge-1/0/7
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

kakfa-jumbo1002:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.) No changes already in 10G Rack
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move . No changes already in 10G Rack
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

kakfa-jumbo1003:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.)
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move
  • - @RobH sets up network port with the above info. asw-b2-eqiad:35
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @RobH removes switch config for old 1g port asw2-b-eqiad:xe-2/0/35
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

kakfa-jumbo1004:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.) . No changes already in 10G Rack
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move . No changes already in 10G Rack
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

kakfa-jumbo1005:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.) . No changes already in 10G Rack
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move . No changes already in 10G Rack
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

kakfa-jumbo1006:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.)
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move
  • - @RobH sets up network port with the above info. asw-d7-eqiad:36
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @RobH removes switch config for old 1g port asw2-d-eqiad:ge-1/0/5
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

Event Timeline

RobH created this task.Oct 23 2019, 9:13 PM
RobH added a parent task: Unknown Object (Task).
fdans triaged this task as High priority.Oct 24 2019, 4:44 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.
RobH removed a subscriber: RobH.Oct 24 2019, 5:55 PM
elukey added a comment.EditedOct 28 2019, 10:38 AM

We can definitely reimage if it is the best path suggested by SRE, but if possible I'd do it manually (so commenting temporarily the partman recipe) to avoid the /srv partition to be wiped in the process. Data is replicated in 3 brokers by default, but bootstrapping a node without data is a bit overkill if not needed.

Would it be ok for you @RobH ?

elukey added a subscriber: RobH.Oct 28 2019, 10:39 AM

Agree with Luca! As long as Kafka's data is maintained, re-imaging should be the same as a downtime for Kafka.

Reporting a chat with Rob on IRC. We could do the following as test:

  1. start with kafka-jumbo1001, schedule downtime and stop kafka. Also systemctl mask kafka to be able to freely shutdown/start.
  2. shutdown and add the new NIC, wire it and boot again
  3. via mgmt console, adjust on the host the configuration (/etc/interfaces, 70-local-persistent-net.rules, etc..)
  4. test that basic networking works

After 4), if the answer is yes then kafka can be started, otherwise a reimage is needed.

When we are ready we can coordinate to add the new NIC to kafka-jumbo1001 :)

These are projected to be in eqiad Dec 5th

@elukey Received cards please message me on irc and we can start scheduling replacement

RobH added a comment.Dec 9 2019, 4:51 PM

Please note that I am around and able to assist for this.

@Jclark-ctr: I recommend you snag 6 DAC cables, and figure out where each of these new systems is going to go. Then you can plug in and label the DAC cables, and I can setup their ports before you ever move an actual server.

So please list off the rack & port for each server's eventual move and I can setup the ports.

I've added checklist steps for each of these port setups.

RobH updated the task description. (Show Details)Dec 9 2019, 4:53 PM
RobH added a comment.Dec 9 2019, 6:18 PM

We can definitely reimage if it is the best path suggested by SRE, but if possible I'd do it manually (so commenting temporarily the partman recipe) to avoid the /srv partition to be wiped in the process. Data is replicated in 3 brokers by default, but bootstrapping a node without data is a bit overkill if not needed.
Would it be ok for you @RobH ?

I think its fine for you to reimage (and exclude wiping /srv) for this, considering we know the partition layout (with FULL disk reimage) works already. The existing partman recipe has been tested/confirmed/reused a lot.

RobH reassigned this task from Cmjohnson to Jclark-ctr.Dec 9 2019, 6:25 PM
RobH updated the task description. (Show Details)

Please note I've chatted with @wiki_willy, @Jclark-ctr, & @elukey about this, and I've updated all of the checklists for each server with the following:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.)
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move
  • - @RobH sets up network port with the above info.
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service
Jclark-ctr added a comment.EditedDec 10 2019, 10:44 PM

@RobH
Server New Rack Switchport
kafka-jumbo1001 a4 39
kafka-jumbo1003 b2 35
kafka-jumbo1006 d7 36

Jclark-ctr updated the task description. (Show Details)Dec 10 2019, 10:57 PM
RobH updated the task description. (Show Details)Dec 11 2019, 9:26 PM
RobH updated the task description. (Show Details)Dec 11 2019, 9:28 PM
RobH added a comment.Dec 11 2019, 9:31 PM

@RobH
Server New Rack Switchport
kafka-jumbo1001 a4 39
kafka-jumbo1003 b2 35
kafka-jumbo1006 d7 36

kafka-jumbo1001's 1G interface is on asw2-a-eqiad:ge-1/0/7, renamed kafka-jumbo1001-old
kakfa-jumbo1001's 10G interface is on asw2-a-eqiad:xe-4/0/39, named as kafka-jumbo1001

kafka-jumbo1003's 1G interface is on asw2-b-eqiad:ge-1/0/16, renamed kafka-jumbo1003-old
kakfa-jumbo1003's 10G interface is on asw2-b-eqiad:xe-2/0/35, named as kafka-jumbo1003

kafka-jumbo1006's 1G interface is on asw2-d-eqiad:ge-1/0/5, renamed kafka-jumbo1006-old
kakfa-jumbo1006's 10G interface is on asw2-d-eqiad:xe-7/0/36, named as kafka-jumbo1006

Once these hosts has been moved, their old interfaces need to be removed from the switch. I've updated the checklist for these three hosts with that step. The ports for the three moves are all done.

@Jclark-ctr where are you with these?

@Cmjohnson no nic installed or host moved yet. @RobH had helped with 10g interfaces

@Cmjohnson @Jclark-ctr let's sync about next steps whenever you have time!