Page MenuHomePhabricator

rack/setup/install kafka-main200[1-5]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and installation of the 5 new kafaka hosts received in T221656.

Rack proposal:

  • kafka-main2001 on row A rack A4 U20
  • kafka-main2002 on row B rack B4 U3
  • kafka-main2003 on row C rack C7 U12
  • kafka-main2004 on row D rack D4 U15
  • kafka-main2005 on row D rack D7 U17

kafka-main2001: A4:xe-4/0/19

  • - receive in system on procurement task T221656
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID 10 512kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged & IMMEDIATELY RUN/SIGN PUPPET
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

kafka-main2002: B4:xe-4/0/2

  • - receive in system on procurement task T221656
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID 10, 512kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged & IMMEDIATELY RUN/SIGN PUPPET
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

kafka-main2003: C7:xe-7/0/8

  • - receive in system on procurement task T221656
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID 10
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged & IMMEDIATELY RUN/SIGN PUPPET
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

kafka-main2004: D4:xe-4/0/14

  • - receive in system on procurement task T221656
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID 10
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged & IMMEDIATELY RUN/SIGN PUPPET
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

kafka-main2005: D7:xe-7/0/11

  • - receive in system on procurement task T221656
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID 10
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged & IMMEDIATELY RUN/SIGN PUPPET
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

Event Timeline

Papaul triaged this task as Medium priority.May 16 2019, 10:50 PM
Papaul added a subscriber: Ottomata.

@herron, it might be worth renaming these hosts! E.g. kafka-main200[4-8] would be more appropriate! :)

herron renamed this task from rack/setup/install kafka200[4-8] to rack/setup/install kafka-main200[1-5].May 17 2019, 2:44 PM
herron updated the task description. (Show Details)

Good point! And if we number from 200[1-5] it should simplify mapping of broker IDs between old and new hosts too. I updated the description to reflect this, but if you think its best to keep the 200[4-8] suffix happy to go that route instead.

@Ottomata can you please provide me with the partman recipe for those systems and for the RAID config do you want user 256KB for RAID Stripe size?

can you please provide me with the partman recipe for those systems

raid10-gpt-srv-lvm-ext4.cfg would work, but it uses only disks. I think you should be able to make a new modified one to use all 8.

do you want user 256KB for RAID Stripe size?

Hm, hadn't thought about this. I'd think a largeish stripe size would be good for Kafka. Reads are usually very sequential. Writes are done in batches, but the sizes can vary based on the volume of particular topic-partitions. I do see quite a few smallish files for indexes.

I just looked at kafka1001, but I'm not exactly sure if I found the right thing. Is mdadm 'chunk size' the same as the stripe size? If so, it is set at 512K there. I don't think we changed it intentionally for other kafka hosts, so perhaps mdadm's default is fine?

@Ottomata thanks for the updates. Can you please make the necessaries modifications needed on raid10-gpt-srv-lvm-ext4.cfg .

Please note these are showing as an error state of staged in netbox, when they are not yet installed with an OS and have not yet run puppet.

I have changed all of the kafka-main200[1-5] to state 'planned' in netbox. Once the OS is installed (and initial puppet run is completed immediately after OS installation), it can change to staged.

Can you please make the necessaries modifications needed on raid10-gpt-srv-lvm-ext4.cfg .

Ping @herron on this one, SRE is handling these. :)

Change 511952 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] partman: add 8 disk raid10 layout

https://gerrit.wikimedia.org/r/511952

Change 511952 merged by Herron:
[operations/puppet@production] partman: add 8 disk raid10 layout

https://gerrit.wikimedia.org/r/511952

Hey @Papaul, I added a raid10-gpt-srv-lvm-ext4-8disks.cfg for the initial installs on these.

Once they are up and running I'll do a little benchmarking to try and see if there isn't a better sweet spot in terms of stripe/chunk sizes, and update the config if needed. Thanks!

Change 512069 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Add mgmt and production DNS for kafka-main200[1-5]

https://gerrit.wikimedia.org/r/512069

Change 512069 merged by Herron:
[operations/dns@master] DNS: Add mgmt and production DNS for kafka-main200[1-5]

https://gerrit.wikimedia.org/r/512069

Change 512236 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Add MAC address for kafka-main2001

https://gerrit.wikimedia.org/r/512236

Change 512236 merged by Dzahn:
[operations/puppet@production] DHCP: Add MAC address for kafka-main2001

https://gerrit.wikimedia.org/r/512236

@herron can't find a partman recipe

────────────────────────┤ [!!] Partition disks ├─────────────────────────┐

│                                                                         │   
│ The installer can guide you through partitioning a disk (using          │   
│ different standard schemes) or, if you prefer, you can do it            │   
│ manually. With guided partitioning you will still have a chance later   │   
│ to review and customise the results.                                    │   
│                                                                         │   
│ If you choose guided partitioning for an entire disk, you will next     │   
│ be asked which disk should be used.                                     │   
│                                                                         │   
│ Partitioning method:                                                    │   
│                                                                         │   
│          Guided - use entire disk                                       │   
│          Guided - use entire disk and set up LVM                        │   
│          Guided - use entire disk and set up encrypted LVM              │   
│          Manual                                                         │   
│                                                                         │   
│     <Go Back>                                                           │   
│

ok, no worries I'll poke at this for a bit and try to get it installed

Change 512307 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] netboot: assign kafka-main[12]00[1-5] 8 disk raid10 partman config

https://gerrit.wikimedia.org/r/512307

Change 512307 merged by Herron:
[operations/puppet@production] netboot: assign kafka-main[12]00[1-5] 8 disk raid10 partman config

https://gerrit.wikimedia.org/r/512307

Kafka-main2001 is installed. I updated the netboot config to assign the partman config to these hostnames, and switched the hardware controller to HBA mode. Then it completed the install using the 8 disk md raid10 config. Now to do some testing!

Papaul updated the task description. (Show Details)

@herron All is done at my end what left to be done is just the OS install. Let me know if you have any questions

@herron also after the OS install please remember to change Netbox status to "staged"

I did some testing of various software and hardware raid configurations and wrote up a summary at https://wikitech.wikimedia.org/wiki/Kafka/Kafka-main-raid-performance-testing-2019

The tldr is that indeed 512K chunk size tested to provide quite good performance, when combined with the noop scheduler. Surprisingly (or maybe not so much since we standardized on a bit older generation raid cards) hardware raid10 tested quite a bit worse than software raid10. The difference was significant enough that I think it justifies moving forward with sw raid.

So, IMO the optimal path forward for these hosts will be raid cards in HBA mode, and OS installed using the raid10-gpt-srv-lvm-ext4-8disks.cfg partman config. Barring objections I'll complete the builds using this layout.

@Papaul could you have a look at kafka-main2002? It seems to be stuck, at least I'm not able to open a console or power cycle.

/admin1-> racadm serveraction powercycle
Unable to perform requested operation.
/admin1-> racadm serveraction powerstatus
Server power status: ON
/admin1-> racadm serveraction powerdown
Unable to perform requested operation.
/admin1-> racadm serveraction hardreset
Unable to perform requested operation.

Today I tried to perform OS installs on kafka-main200[345] but was not seeing DHCP requests from these hosts make it to the installNNNN hosts yet.

Also, when I tried to switch kafka-main2004 raid card to HBA mode I didn't see the pause/prompt in the bootup sequence for raid setup. On the other hosts there is a pause/prompt to press ctrl+R to enter raid setup, but for some reason it seems to be missing currently on kafka-main2004.

To recap open items:

  1. kafka-main2002 is erroring when issued racadm serveraction power commands via mgmt interface T223493#5226697
  2. kafka-main200[2345] are not yet able to net boot
  3. kafka-main2004 is not prompting for raid card setup on boot, and needs to be put into HBA mode

@Papaul if it's alright I'll kick this over to you to have a look. Please let me know if there's anything I can do to help out!

  1. kafka-main200[2345] are not yet able to net boot

13:25 < papaul> herron: the problem is i added just kafka-main2001 to the DHCP file and

not the others that is why you can not boot the other servers

13:25 < papaul> since you wanted to do only the test on 2001

I would expect though that the DHCP requests would make it to the install servers, with or without entries in the dhcp config file.

kafak-main2002
after power drain

/admin1-> racadm serveraction powercycle
Server power operation initiated successfully
/admin1->

I would expect though that the DHCP requests would make it to the install servers, with or without entries in the dhcp config file.

Depending on which interface you selecting for pxe boot by defaut those servers will boot on the 1GB interface and there is nothing plugged into the 1Gb interface. You will have to go in the BIOS disable the 1Gb interfaces. I am doing all that so need to bother.

Change 514030 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Add MAC address entries for kafka-main200[2-5]

https://gerrit.wikimedia.org/r/514030

@herron You can merge the DHCP code and you should be good.

Change 514030 merged by Herron:
[operations/puppet@production] DHCP: Add MAC address entries for kafka-main200[2-5]

https://gerrit.wikimedia.org/r/514030

Change 514107 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] install_server: tweak raid10 8disk partman layout

https://gerrit.wikimedia.org/r/514107

Change 514107 merged by Herron:
[operations/puppet@production] install_server: tweak raid10 8disk partman layout

https://gerrit.wikimedia.org/r/514107

Kafka-main200[123], and kafka-main2005 are installed, have had the initial puppet run applied and are now marked "staged" in netbox.

Kafka-main2004 however is still not seeing the raid card, and the install is failing due to no disks found. @Papaul did you have a chance to take a look at this?

herron updated the task description. (Show Details)

That did the trick! All of the new codfw kafka-main hosts are now installed and ready for service setup

herron updated the task description. (Show Details)

Tracking service implementation in T225005

herron mentioned this in Unknown Object (Task).Jun 21 2019, 5:19 PM