Page MenuHomePhabricator

replace onboard NIC in kafka-jumbo100[1-6]
Closed, ResolvedPublic

Description

This task will track the swap of the onboard NIC from the 1GB to the 10GB+1GB NICs being ordered via T227148.

Once the network cards are received in on T227148, this can proceed.

Re-image question: We will need an answer on this, will these be re-imaged? (perhaps @elukey will know) for the new network card? As it is an onboard network card, doing a re-image would result in the least unexpected issues from the change, but may not be something that can occur in quick succession across this fleet.

kakfa-jumbo1001:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.)
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move
  • - @RobH sets up network port with the above info. asw-a4-eqiad:39
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @RobH removes switch config for old 1g port asw2-a-eqiad:ge-1/0/7
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

kakfa-jumbo1002:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.) No changes already in 10G Rack
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move . No changes already in 10G Rack
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

kakfa-jumbo1003:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.)
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move
  • - @RobH sets up network port with the above info. asw-b2-eqiad:35
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @RobH removes switch config for old 1g port asw2-b-eqiad:xe-2/0/35
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

kakfa-jumbo1004:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.) . No changes already in 10G Rack
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move . No changes already in 10G Rack
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

kakfa-jumbo1005:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.) . No changes already in 10G Rack
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move . No changes already in 10G Rack
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

kakfa-jumbo1006:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.)
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move
  • - @RobH sets up network port with the above info. asw-d7-eqiad:36
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @RobH removes switch config for old 1g port asw2-d-eqiad:ge-1/0/5
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

Event Timeline

RobH created this task.Oct 23 2019, 9:13 PM
RobH added a parent task: Unknown Object (Task).
fdans triaged this task as High priority.Oct 24 2019, 4:44 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.
RobH removed a subscriber: RobH.Oct 24 2019, 5:55 PM
elukey added a comment.EditedOct 28 2019, 10:38 AM

We can definitely reimage if it is the best path suggested by SRE, but if possible I'd do it manually (so commenting temporarily the partman recipe) to avoid the /srv partition to be wiped in the process. Data is replicated in 3 brokers by default, but bootstrapping a node without data is a bit overkill if not needed.

Would it be ok for you @RobH ?

elukey added a subscriber: RobH.Oct 28 2019, 10:39 AM

Agree with Luca! As long as Kafka's data is maintained, re-imaging should be the same as a downtime for Kafka.

Reporting a chat with Rob on IRC. We could do the following as test:

  1. start with kafka-jumbo1001, schedule downtime and stop kafka. Also systemctl mask kafka to be able to freely shutdown/start.
  2. shutdown and add the new NIC, wire it and boot again
  3. via mgmt console, adjust on the host the configuration (/etc/interfaces, 70-local-persistent-net.rules, etc..)
  4. test that basic networking works

After 4), if the answer is yes then kafka can be started, otherwise a reimage is needed.

When we are ready we can coordinate to add the new NIC to kafka-jumbo1001 :)

These are projected to be in eqiad Dec 5th

@elukey Received cards please message me on irc and we can start scheduling replacement

RobH added a comment.Dec 9 2019, 4:51 PM

Please note that I am around and able to assist for this.

@Jclark-ctr: I recommend you snag 6 DAC cables, and figure out where each of these new systems is going to go. Then you can plug in and label the DAC cables, and I can setup their ports before you ever move an actual server.

So please list off the rack & port for each server's eventual move and I can setup the ports.

I've added checklist steps for each of these port setups.

RobH updated the task description. (Show Details)Dec 9 2019, 4:53 PM
RobH added a comment.Dec 9 2019, 6:18 PM

We can definitely reimage if it is the best path suggested by SRE, but if possible I'd do it manually (so commenting temporarily the partman recipe) to avoid the /srv partition to be wiped in the process. Data is replicated in 3 brokers by default, but bootstrapping a node without data is a bit overkill if not needed.

Would it be ok for you @RobH ?

I think its fine for you to reimage (and exclude wiping /srv) for this, considering we know the partition layout (with FULL disk reimage) works already. The existing partman recipe has been tested/confirmed/reused a lot.

RobH reassigned this task from Cmjohnson to Jclark-ctr.Dec 9 2019, 6:25 PM
RobH updated the task description. (Show Details)

Please note I've chatted with @wiki_willy, @Jclark-ctr, & @elukey about this, and I've updated all of the checklists for each server with the following:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.)
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move
  • - @RobH sets up network port with the above info.
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service
Jclark-ctr added a comment.EditedDec 10 2019, 10:44 PM

@RobH
Server New Rack Switchport
kafka-jumbo1001 a4 39
kafka-jumbo1003 b2 35
kafka-jumbo1006 d7 36

Jclark-ctr updated the task description. (Show Details)Dec 10 2019, 10:57 PM
RobH updated the task description. (Show Details)Dec 11 2019, 9:26 PM
RobH updated the task description. (Show Details)Dec 11 2019, 9:28 PM
RobH added a comment.Dec 11 2019, 9:31 PM

@RobH
Server New Rack Switchport
kafka-jumbo1001 a4 39
kafka-jumbo1003 b2 35
kafka-jumbo1006 d7 36

kafka-jumbo1001's 1G interface is on asw2-a-eqiad:ge-1/0/7, renamed kafka-jumbo1001-old
kakfa-jumbo1001's 10G interface is on asw2-a-eqiad:xe-4/0/39, named as kafka-jumbo1001

kafka-jumbo1003's 1G interface is on asw2-b-eqiad:ge-1/0/16, renamed kafka-jumbo1003-old
kakfa-jumbo1003's 10G interface is on asw2-b-eqiad:xe-2/0/35, named as kafka-jumbo1003

kafka-jumbo1006's 1G interface is on asw2-d-eqiad:ge-1/0/5, renamed kafka-jumbo1006-old
kakfa-jumbo1006's 10G interface is on asw2-d-eqiad:xe-7/0/36, named as kafka-jumbo1006

Once these hosts has been moved, their old interfaces need to be removed from the switch. I've updated the checklist for these three hosts with that step. The ports for the three moves are all done.

@Jclark-ctr where are you with these?

@Cmjohnson no nic installed or host moved yet. @RobH had helped with 10g interfaces

@Cmjohnson @Jclark-ctr let's sync about next steps whenever you have time!

RobH removed a subscriber: RobH.Mar 3 2020, 6:17 PM
elukey moved this task from Backlog to Q1 2020/2021 on the Analytics-Clusters board.
elukey added a subscriber: RobH.
RobH removed a subscriber: RobH.Jun 10 2020, 3:41 PM

@elukey I am on site every tuesday and thursday. usually arrive at 9:00am est message me on irc to workout a schedule that works for you

Thanks a lot! We don't have a lot of time during these days, would it be ok to schedule something early next Q? (In July I mean)

@elukey Sounds good. i will be taking a vacation in august so july would be best

Aklapper removed a project: Analytics.Jul 4 2020, 7:59 AM

@elukey can you let me know your availability for scheduling this project?

@elukey can you let me know your availability for scheduling this project?

Any time that you are free would be ok, we can schedule the first host during one of your morning to test the procedure? (me shutting down kafka, then you adding the NIC, then OS config etc..)

@Jclark-ctr should we sync about this to schedule the first host (when you have time of course)?

@elukey will these host be part of the fail over?

elukey added a comment.Sep 1 2020, 6:31 AM

@Jclark-ctr they will not but we can do one host at the time anyway when you have time!

elukey reassigned this task from Jclark-ctr to Cmjohnson.Oct 12 2020, 9:26 AM
elukey added a subscriber: RobH.

@Cmjohnson if you have some time during the next days can we swap the NIC on one node only? (to verify the procedure and make sure that the NICs are ok etc..)

@elukey I am sorry but I have ot push these off to the first week of November. Let's coordinate a schedule next week.

After booting kafka-jumbo1006 with the 10g nic:

[Fri Oct 30 15:05:14 2020] bnx2x 0000:01:00.0: firmware: failed to load bnx2x/bnx2x-e2-7.13.1.0.fw (-2)
[Fri Oct 30 15:05:14 2020] firmware_class: See https://wiki.debian.org/Firmware for information about missing firmware
[Fri Oct 30 15:05:14 2020] bnx2x 0000:01:00.0: Direct firmware load for bnx2x/bnx2x-e2-7.13.1.0.fw failed with error -2
[Fri Oct 30 15:05:14 2020] bnx2x: [bnx2x_func_hw_init:6003(eno1)]Error loading firmware
[Fri Oct 30 15:05:14 2020] bnx2x: [bnx2x_nic_load:2732(eno1)]HW init failed, aborting

Change 637711 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] adding new mac address for update 10G nic kafka-jumbo1006

https://gerrit.wikimedia.org/r/637711

We had to rollback the NIC on 1006, we need to install firmware-bnx2x on all nodes before doing any work (checked with Faidon since it is a non-free package). The drivers are usually added ad d-i/install time, but since we are not reimaging, we need to do it manually.

RobH removed a subscriber: RobH.Oct 30 2020, 7:54 PM
elukey added a comment.Nov 2 2020, 7:31 AM

firmware-bnx2x installed manually on kafka-jumbo1006, we can retry the switch anytime to see if it works.

elukey added a comment.Nov 3 2020, 7:12 AM

Reporting a conversation with Chris over email about when to do the maintenance:

The 4th works for me at 1130EST

If possible I'd just swap the NIC on jumbo1006 and let it bake for a couple of days before proceeding with the rest.

elukey added a comment.Nov 3 2020, 4:21 PM

Today I swapped the NIC on kafka-jumbo1006 with Chris and there was no need for /etc/network/interfaces changes, firmware-bnx2x was sufficient (no renaming of the eno1 interface etc..).

Next step is then to swap NICs on 1001->1005.

elukey added a comment.Nov 4 2020, 4:24 PM

Reporting in here a chat with Chris - the maintenance is postponed to tomorrow (5th)

Cmjohnson closed this task as Resolved.Nov 5 2020, 8:21 PM

this has been completed

Change 637711 merged by Cmjohnson:
[operations/puppet@production] adding new mac address for update 10G nic kafka-jumbo1006

https://gerrit.wikimedia.org/r/637711