Page MenuHomePhabricator

Q1:rack/setup/install kafka-stretch200[12]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of kafka-stretch200[12]

Hostname / Racking / Installation Details

Hostnames: kafka-stretch200[12]
Racking Proposal: In different rows please.
Networking Setup:

  • # of Connections: 1
  • Speed: 10G.
  • Vlan: Private
  • AAAA records:Y,
  • Additional IP records: no

Partitioning/Raid: Same as kafka-jumbo1009
OS Distro: Bullseye
Sub-team Technical Contact: @Ottomata and @BTullis

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

kafka-stretch2001:
  • - receive in system on procurement task T311864 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
kafka-stretch2002:
  • - receive in system on procurement task T311864 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH unsubscribed.

@Ottomata @BTullis in the description it says "# of Connections: 2" can I please have more details? Thanks.

I'm not sure why it says that. @RobH is it possible that is leftover from some phab template copy/paste?

@Ottomata,

I copy and paste whats in the procuremnt task, which shows your update on Friday, July 29th with a diff showing you put in 2, not 1, in that field: https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-wyu2o3akscitvmw/

So that is what you listed on the ordering task, so I put it on the racking task. If this should instead be just 1 connection (seems likely), then that is likely the information @Papaul is seeking.

Hm, then it is my open copy/paste error! One connection. Will edit task.

Change 824311 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add kafka-stretch200[12] to site.pp and to netboot.cfg

https://gerrit.wikimedia.org/r/824311

Change 824311 merged by Papaul:

[operations/puppet@production] Add kafka-stretch200[12] to site.pp and to netboot.cfg

https://gerrit.wikimedia.org/r/824311

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-stretch2001.codfw.wmnet with OS bullseye

@Ottomata i getting the error below on kafka-stretch2001. I will check the HW side if you have a minutes can you please double check the partman recipe?

thanks

reuse-parts: Recipe device matching failed       │
           │ ERROR: =dev=mapper=vg--data-srv matches zero devices  │
           │                                                       │
           │ All devices:                                          │
           │ =dev=sda                                              │
           │ =dev=sdb                                              │
           │                                                       │
           │     <Go Back>                          <Continue>     │
           │

Hm, in netboot.cfg, I see that kafka-jumbo nodes are currently set to use partman/custom/reuse-kafka-jumbo.cfg. That recipe has /dev/mapper/vg--data-srv|1 ext4 keep /srv. Is it possible that reuse-kafka-jumbo.cfg is meant for reimaging existing Kafka nodes?

There is also a partman/custom/kafka-jumbo.cfg which looks like it does the right thing:

# configuration:
#  * hardware raid on kafka-jumbo hosts
#  * sda hw raid1 (Flex Bay): 2 * 1TB / 2 * 500GB
#  * sdb hw raid10: 12 * 4TB
#
# * GPT partitions:
#   - boot 300MB (biosgrub type, see below)
#   - LVM
#   - /:    ext4, max of /dev/sda (varies across hosts)
#   - /srv: ext4, max of /dev/sdb

So perhaps we should use partman/custom/kafka-jumbo.cfg ?

Thanks I will try with partman/custom/kafka-jumbo.cfg

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-stretch2001.codfw.wmnet with OS bullseye executed with errors:

  • kafka-stretch2001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Change 824530 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Update partman for kafka-stretch200[12]

https://gerrit.wikimedia.org/r/824530

Change 824530 merged by Papaul:

[operations/puppet@production] Update partman for kafka-stretch200[12]

https://gerrit.wikimedia.org/r/824530

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-stretch2001.codfw.wmnet with OS bullseye

@Ottomata on the new recipe i am getting

                                                             │ │
  │ │                        383.6 GB is too small                        │ │
  │ │ You asked for 383.6 GB to be used for guided partitioning, but the  │ │
  │ │ selected partitioning recipe requires at least 6.0 TB.              │ │
  │ │                                                                     │ │
  └─│     <Go Back>

i think the recipe is detecting the SSD as /dev/sdb and the HDD as /dev/sda

Hm, okay, so do we need a new recipe then? This might be a recipe that will be reused for many Config I hosts.

yes if we can make a new one works for config I that will be great . here is the HW raid setting that i have for the server

	Virtual Disk238	Online	RAID-10	22353 GB	HDD		
 	Virtual Disk239	Online	RAID-1	        446.63 GB	SSD

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-stretch2001.codfw.wmnet with OS bullseye executed with errors:

  • kafka-stretch2001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

@Papaul, I asked @RobH in #wikimedia-dcops on IRC:

i think the recipe is detecting the SSD as /dev/sdb and the HDD as /dev/sda

yeah, thats a new controller issue
there is a fix to swap them back

He's going to log into the host and try to swap them.

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye executed with errors:

  • kafka-stretch2001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

@Ottomata thanks will look into it once i am done with this Dell call.

I've gone ahead and checked and followed the comments posted on T297913#8041258 and it is still setting the SSDs as SDB in the installer.

I know we ran into this issue on the first batch of H750 controller hosts, but this fix on the task doesn't seem to be viable here. (The boot device SSD is already selected).

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye

I've gone ahead and checked and followed the comments posted on T297913#8041258 and it is still setting the SSDs as SDB in the installer.

I know we ran into this issue on the first batch of H750 controller hosts, but this fix on the task doesn't seem to be viable here. (The boot device SSD is already selected).

Screen Shot 2022-08-18 at 11.16.21 AM.png (704×1 px, 141 KB)

The SSDs are showing as an ID 239 and HDD as ID 238 in VD view. Setting the SSDs to bootable option doesn't seem to fix the detection order in bullseye installer. I'm searching older phab tasks linked off T297913 to try to determine how we figured this out before, but I dont recall directly.

While I got the error that everyone else got (no space for data drive as its detecting the ssds as data mount), it seems to be running successfully on my second attempt at wiping config, rebuilding raid arrays, setting boot order, and running the script.

In double checking the checklist steps, I can see that these haven't yet been received in on the coupa PO, so I unchecked that box in the kafka-stretch2002 section of the checklist in the task description.

RobH added a parent task: Unknown Object (Task).Aug 18 2022, 6:39 PM

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye completed:

  • kafka-stretch2001 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208181817_robh_1283571_kafka-stretch2001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-stretch2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-stretch2002.codfw.wmnet with OS bullseye executed with errors:

  • kafka-stretch2002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-stretch2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-stretch2002.codfw.wmnet with OS bullseye completed:

  • kafka-stretch2002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208182039_pt1979_639827_kafka-stretch2002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

@Ottomata all yours thanks for the help

Papaul updated the task description. (Show Details)

Complete

Hello, just FYI I reimaged kafka-stretch2002 because the /dev/sda and /dev/sdb were the wrong way around.

image.png (337×956 px, 77 KB)

I've been investigating the install issues with kafka-stretch1001 in T314156: Q1:rack/setup/install kafka-stretch100[12] but I think it's still something strange about the way that the RAID controller is configured.

Now kafka-stretch2001 is the only one of these four kafka-stretch hosts left with the drive order reversed.

btullis@cumin1001:~$ sudo cumin kafka-stretch* 'lsblk'
4 hosts will be targeted:
kafka-stretch[2001-2002].codfw.wmnet,kafka-stretch[1001-1002].eqiad.wmnet
Ok to proceed on 4 hosts? Enter the number of affected hosts to confirm or "q" to quit 4
===== NODE GROUP =====                                                                                                                                                                                             
(1) kafka-stretch2001.codfw.wmnet                                                                                                                                                                                  
----- OUTPUT of 'lsblk' -----                                                                                                                                                                                      
NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT                                                                                                                                                                  
sda            8:0    0  21.8T  0 disk                                                                                                                                                                             
└─sda1         8:1    0  21.8T  0 part 
  └─vg1-srv  254:0    0  17.5T  0 lvm  /srv
sdb            8:16   0 446.6G  0 disk 
├─sdb1         8:17   0   285M  0 part 
└─sdb2         8:18   0 446.3G  0 part 
  └─vg0-root 254:1    0 357.1G  0 lvm  /
===== NODE GROUP =====                                                                                                                                                                                             
(3) kafka-stretch2002.codfw.wmnet,kafka-stretch[1001-1002].eqiad.wmnet                                                                                                                                             
----- OUTPUT of 'lsblk' -----                                                                                                                                                                                      
NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT                                                                                                                                                                  
sda            8:0    0 446.6G  0 disk                                                                                                                                                                             
├─sda1         8:1    0   285M  0 part 
└─sda2         8:2    0 446.3G  0 part 
  └─vg0-root 254:0    0 357.1G  0 lvm  /
sdb            8:16   0  21.8T  0 disk 
└─sdb1         8:17   0  21.8T  0 part 
  └─vg1-srv  254:1    0  17.5T  0 lvm  /srv
================

Reimaging it now.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye executed with errors:

  • kafka-stretch2001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212151459_btullis_1821231_kafka-stretch2001.out
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

OK, I think that both of these two hosts are set up correctly now. The failure in the cookbook above was only a delayed install of perccli64 which doesn't warrant another reimage.

So all four kafka-stretch servers no have the correct RAID configuration and ordering of /dev/sda and /dev/sdb.

 btullis@cumin1001:~$ sudo cumin kafka-stretch* lsblk
4 hosts will be targeted:
kafka-stretch[2001-2002].codfw.wmnet,kafka-stretch[1001-1002].eqiad.wmnet
Ok to proceed on 4 hosts? Enter the number of affected hosts to confirm or "q" to quit 4
===== NODE GROUP =====                                                                                                                                                                                             
(4) kafka-stretch[2001-2002].codfw.wmnet,kafka-stretch[1001-1002].eqiad.wmnet                                                                                                                                      
----- OUTPUT of 'lsblk' -----                                                                                                                                                                                      
NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT                                                                                                                                                                  
sda            8:0    0 446.6G  0 disk                                                                                                                                                                             
├─sda1         8:1    0   285M  0 part 
└─sda2         8:2    0 446.3G  0 part 
  └─vg0-root 254:0    0 357.1G  0 lvm  /
sdb            8:16   0  21.8T  0 disk 
└─sdb1         8:17   0  21.8T  0 part 
  └─vg1-srv  254:1    0  17.5T  0 lvm  /srv
================                                                                                                                                                                                                   
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (4/4) [00:00<00:00,  4.60hosts/s]
FAIL |                                                                                                                                                                             |   0% (0/4) [00:00<?, ?hosts/s]
100.0% (4/4) success ratio (>= 100.0% threshold) for command: 'lsblk'.
100.0% (4/4) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.