Page MenuHomePhabricator

replace onboard NIC in kafka-jumbo100[1-6]
Closed, ResolvedPublic

Description

This task will track the swap of the onboard NIC from the 1GB to the 10GB+1GB NICs being ordered via T227148.

Once the network cards are received in on T227148, this can proceed.

Re-image question: We will need an answer on this, will these be re-imaged? (perhaps @elukey will know) for the new network card? As it is an onboard network card, doing a re-image would result in the least unexpected issues from the change, but may not be something that can occur in quick succession across this fleet.

kakfa-jumbo1001:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.)
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move
  • - @RobH sets up network port with the above info. asw-a4-eqiad:39
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @RobH removes switch config for old 1g port asw2-a-eqiad:ge-1/0/7
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

kakfa-jumbo1002:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.) No changes already in 10G Rack
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move . No changes already in 10G Rack
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

kakfa-jumbo1003:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.)
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move
  • - @RobH sets up network port with the above info. asw-b2-eqiad:35
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @RobH removes switch config for old 1g port asw2-b-eqiad:xe-2/0/35
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

kakfa-jumbo1004:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.) . No changes already in 10G Rack
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move . No changes already in 10G Rack
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

kakfa-jumbo1005:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.) . No changes already in 10G Rack
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move . No changes already in 10G Rack
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

kakfa-jumbo1006:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.)
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move
  • - @RobH sets up network port with the above info. asw-d7-eqiad:36
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @RobH removes switch config for old 1g port asw2-d-eqiad:ge-1/0/5
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

Event Timeline

RobH added a parent task: Unknown Object (Task).
fdans moved this task from Incoming to Operational Excellence on the Analytics board.

We can definitely reimage if it is the best path suggested by SRE, but if possible I'd do it manually (so commenting temporarily the partman recipe) to avoid the /srv partition to be wiped in the process. Data is replicated in 3 brokers by default, but bootstrapping a node without data is a bit overkill if not needed.

Would it be ok for you @RobH ?

Agree with Luca! As long as Kafka's data is maintained, re-imaging should be the same as a downtime for Kafka.

Reporting a chat with Rob on IRC. We could do the following as test:

  1. start with kafka-jumbo1001, schedule downtime and stop kafka. Also systemctl mask kafka to be able to freely shutdown/start.
  2. shutdown and add the new NIC, wire it and boot again
  3. via mgmt console, adjust on the host the configuration (/etc/interfaces, 70-local-persistent-net.rules, etc..)
  4. test that basic networking works

After 4), if the answer is yes then kafka can be started, otherwise a reimage is needed.

When we are ready we can coordinate to add the new NIC to kafka-jumbo1001 :)

These are projected to be in eqiad Dec 5th

@elukey Received cards please message me on irc and we can start scheduling replacement

Please note that I am around and able to assist for this.

@Jclark-ctr: I recommend you snag 6 DAC cables, and figure out where each of these new systems is going to go. Then you can plug in and label the DAC cables, and I can setup their ports before you ever move an actual server.

So please list off the rack & port for each server's eventual move and I can setup the ports.

I've added checklist steps for each of these port setups.

We can definitely reimage if it is the best path suggested by SRE, but if possible I'd do it manually (so commenting temporarily the partman recipe) to avoid the /srv partition to be wiped in the process. Data is replicated in 3 brokers by default, but bootstrapping a node without data is a bit overkill if not needed.

Would it be ok for you @RobH ?

I think its fine for you to reimage (and exclude wiping /srv) for this, considering we know the partition layout (with FULL disk reimage) works already. The existing partman recipe has been tested/confirmed/reused a lot.

RobH updated the task description. (Show Details)

Please note I've chatted with @wiki_willy, @Jclark-ctr, & @elukey about this, and I've updated all of the checklists for each server with the following:

  • - determine what 10G rack (within the same row) this can move into (avoid's IP change if stays in the same row.)
  • - @Jclark-ctr lists off on this task what rack and port# the 10G connection will be on this task in advance of move
  • - @RobH sets up network port with the above info.
  • - @elukey and @Jclark-ctr schedule downtime for this host, and put it offline
  • - @Jclark-ctr moves server to new 10G rack and plugs into the 10G port he listed above, then ensure system posts (OS may or may not work, we just want to ensure it posts and sees the new hardware in bios)
  • - @elukey takes over host troubleshooting and will attempt to fix via serial (to avoid reimage) or reimage (and avoid formatting /sev) as needed.
  • - @elukey returns system to service

@RobH
Server New Rack Switchport
kafka-jumbo1001 a4 39
kafka-jumbo1003 b2 35
kafka-jumbo1006 d7 36

@RobH
Server New Rack Switchport
kafka-jumbo1001 a4 39
kafka-jumbo1003 b2 35
kafka-jumbo1006 d7 36

kafka-jumbo1001's 1G interface is on asw2-a-eqiad:ge-1/0/7, renamed kafka-jumbo1001-old
kakfa-jumbo1001's 10G interface is on asw2-a-eqiad:xe-4/0/39, named as kafka-jumbo1001

kafka-jumbo1003's 1G interface is on asw2-b-eqiad:ge-1/0/16, renamed kafka-jumbo1003-old
kakfa-jumbo1003's 10G interface is on asw2-b-eqiad:xe-2/0/35, named as kafka-jumbo1003

kafka-jumbo1006's 1G interface is on asw2-d-eqiad:ge-1/0/5, renamed kafka-jumbo1006-old
kakfa-jumbo1006's 10G interface is on asw2-d-eqiad:xe-7/0/36, named as kafka-jumbo1006

Once these hosts has been moved, their old interfaces need to be removed from the switch. I've updated the checklist for these three hosts with that step. The ports for the three moves are all done.

@Cmjohnson no nic installed or host moved yet. @RobH had helped with 10g interfaces

@Cmjohnson @Jclark-ctr let's sync about next steps whenever you have time!

@elukey I am on site every tuesday and thursday. usually arrive at 9:00am est message me on irc to workout a schedule that works for you

Thanks a lot! We don't have a lot of time during these days, would it be ok to schedule something early next Q? (In July I mean)

@elukey Sounds good. i will be taking a vacation in august so july would be best

@elukey can you let me know your availability for scheduling this project?

@elukey can you let me know your availability for scheduling this project?

Any time that you are free would be ok, we can schedule the first host during one of your morning to test the procedure? (me shutting down kafka, then you adding the NIC, then OS config etc..)

@Jclark-ctr should we sync about this to schedule the first host (when you have time of course)?

@elukey will these host be part of the fail over?

@Jclark-ctr they will not but we can do one host at the time anyway when you have time!

elukey added a subscriber: RobH.

@Cmjohnson if you have some time during the next days can we swap the NIC on one node only? (to verify the procedure and make sure that the NICs are ok etc..)

@elukey I am sorry but I have ot push these off to the first week of November. Let's coordinate a schedule next week.

After booting kafka-jumbo1006 with the 10g nic:

[Fri Oct 30 15:05:14 2020] bnx2x 0000:01:00.0: firmware: failed to load bnx2x/bnx2x-e2-7.13.1.0.fw (-2)
[Fri Oct 30 15:05:14 2020] firmware_class: See https://wiki.debian.org/Firmware for information about missing firmware
[Fri Oct 30 15:05:14 2020] bnx2x 0000:01:00.0: Direct firmware load for bnx2x/bnx2x-e2-7.13.1.0.fw failed with error -2
[Fri Oct 30 15:05:14 2020] bnx2x: [bnx2x_func_hw_init:6003(eno1)]Error loading firmware
[Fri Oct 30 15:05:14 2020] bnx2x: [bnx2x_nic_load:2732(eno1)]HW init failed, aborting

Change 637711 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] adding new mac address for update 10G nic kafka-jumbo1006

https://gerrit.wikimedia.org/r/637711

We had to rollback the NIC on 1006, we need to install firmware-bnx2x on all nodes before doing any work (checked with Faidon since it is a non-free package). The drivers are usually added ad d-i/install time, but since we are not reimaging, we need to do it manually.

firmware-bnx2x installed manually on kafka-jumbo1006, we can retry the switch anytime to see if it works.

Reporting a conversation with Chris over email about when to do the maintenance:

The 4th works for me at 1130EST

If possible I'd just swap the NIC on jumbo1006 and let it bake for a couple of days before proceeding with the rest.

Today I swapped the NIC on kafka-jumbo1006 with Chris and there was no need for /etc/network/interfaces changes, firmware-bnx2x was sufficient (no renaming of the eno1 interface etc..).

Next step is then to swap NICs on 1001->1005.

Reporting in here a chat with Chris - the maintenance is postponed to tomorrow (5th)

this has been completed

Change 637711 merged by Cmjohnson:
[operations/puppet@production] adding new mac address for update 10G nic kafka-jumbo1006

https://gerrit.wikimedia.org/r/637711