⚓ T314160 Q1:rack/setup/install kafka-stretch200[12]

	Subject	Repo	Branch	Lines +/-
	Update partman for kafka-stretch200[12]	operations/puppet	production	+1 -1
	Add kafka-stretch200[12] to site.pp and to netboot.cfg	operations/puppet	production	+6 -0

RobH created this task.Jul 29 2022, 3:20 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 29 2022, 3:20 PM

RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.Jul 29 2022, 3:20 PM

RobH mentioned this in Unknown Object (Task).

RobH unsubscribed.

JArguello-WMF moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Aug 2 2022, 5:34 PM

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Aug 3 2022, 12:19 PM

• EChetty moved this task from Backlog to Shared Data Infra on the Data-Engineering-Planning board.Aug 15 2022, 8:44 AM

• EChetty added a project: Shared-Data-Infrastructure.Aug 15 2022, 8:50 AM

• EChetty moved this task from Backlog to To be discussed on the Shared-Data-Infrastructure board.Aug 15 2022, 9:06 AM

• EChetty moved this task from To be discussed to Estimated/Discussed on the Shared-Data-Infrastructure board.Aug 16 2022, 1:45 PM

@Ottomata @BTullis in the description it says "# of Connections: 2" can I please have more details? Thanks.

I'm not sure why it says that. @RobH is it possible that is leftover from some phab template copy/paste?

@Ottomata,

I copy and paste whats in the procuremnt task, which shows your update on Friday, July 29th with a diff showing you put in 2, not 1, in that field: https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-wyu2o3akscitvmw/

So that is what you listed on the ordering task, so I put it on the racking task. If this should instead be just 1 connection (seems likely), then that is likely the information @Papaul is seeking.

Hm, then it is my open copy/paste error! One connection. Will edit task.

Ottomata updated the task description. (Show Details)Aug 16 2022, 6:32 PM

Ottomata added a project: Event-Platform.

Change 824311 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add kafka-stretch200[12] to site.pp and to netboot.cfg

https://gerrit.wikimedia.org/r/824311

gerritbot added a project: Patch-For-Review.Aug 17 2022, 11:21 PM

RobH unsubscribed.Aug 17 2022, 11:26 PM

Change 824311 merged by Papaul:

[operations/puppet@production] Add kafka-stretch200[12] to site.pp and to netboot.cfg

https://gerrit.wikimedia.org/r/824311

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-stretch2001.codfw.wmnet with OS bullseye

Maintenance_bot removed a project: Patch-For-Review.Aug 18 2022, 3:31 PM

@Ottomata i getting the error below on kafka-stretch2001. I will check the HW side if you have a minutes can you please double check the partman recipe?

thanks

reuse-parts: Recipe device matching failed       │
           │ ERROR: =dev=mapper=vg--data-srv matches zero devices  │
           │                                                       │
           │ All devices:                                          │
           │ =dev=sda                                              │
           │ =dev=sdb                                              │
           │                                                       │
           │     <Go Back>                          <Continue>     │
           │

Hm, in netboot.cfg, I see that kafka-jumbo nodes are currently set to use partman/custom/reuse-kafka-jumbo.cfg. That recipe has /dev/mapper/vg--data-srv|1 ext4 keep /srv. Is it possible that reuse-kafka-jumbo.cfg is meant for reimaging existing Kafka nodes?

There is also a partman/custom/kafka-jumbo.cfg which looks like it does the right thing:

# configuration:
#  * hardware raid on kafka-jumbo hosts
#  * sda hw raid1 (Flex Bay): 2 * 1TB / 2 * 500GB
#  * sdb hw raid10: 12 * 4TB
#
# * GPT partitions:
#   - boot 300MB (biosgrub type, see below)
#   - LVM
#   - /:    ext4, max of /dev/sda (varies across hosts)
#   - /srv: ext4, max of /dev/sdb

So perhaps we should use partman/custom/kafka-jumbo.cfg ?

Thanks I will try with partman/custom/kafka-jumbo.cfg

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-stretch2001.codfw.wmnet with OS bullseye executed with errors:

kafka-stretch2001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Change 824530 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Update partman for kafka-stretch200[12]

https://gerrit.wikimedia.org/r/824530

Change 824530 merged by Papaul:

[operations/puppet@production] Update partman for kafka-stretch200[12]

https://gerrit.wikimedia.org/r/824530

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-stretch2001.codfw.wmnet with OS bullseye

@Ottomata on the new recipe i am getting

                                                             │ │
  │ │                        383.6 GB is too small                        │ │
  │ │ You asked for 383.6 GB to be used for guided partitioning, but the  │ │
  │ │ selected partitioning recipe requires at least 6.0 TB.              │ │
  │ │                                                                     │ │
  └─│     <Go Back>

i think the recipe is detecting the SSD as /dev/sdb and the HDD as /dev/sda

Maintenance_bot removed a project: Patch-For-Review.Aug 18 2022, 4:30 PM

Hm, okay, so do we need a new recipe then? This might be a recipe that will be reused for many Config I hosts.

yes if we can make a new one works for config I that will be great . here is the HW raid setting that i have for the server

	Virtual Disk238	Online	RAID-10	22353 GB	HDD		
 	Virtual Disk239	Online	RAID-1	        446.63 GB	SSD

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-stretch2001.codfw.wmnet with OS bullseye executed with errors:

kafka-stretch2001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

@Papaul, I asked @RobH in #wikimedia-dcops on IRC:

i think the recipe is detecting the SSD as /dev/sdb and the HDD as /dev/sda

yeah, thats a new controller issue
there is a fix to swap them back

He's going to log into the host and try to swap them.

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye executed with errors:

kafka-stretch2001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

@Ottomata thanks will look into it once i am done with this Dell call.

I've gone ahead and checked and followed the comments posted on T297913#8041258 and it is still setting the SSDs as SDB in the installer.

I know we ran into this issue on the first batch of H750 controller hosts, but this fix on the task doesn't seem to be viable here. (The boot device SSD is already selected).

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye

In T314160#8166577, @RobH wrote:

I've gone ahead and checked and followed the comments posted on T297913#8041258 and it is still setting the SSDs as SDB in the installer.

I know we ran into this issue on the first batch of H750 controller hosts, but this fix on the task doesn't seem to be viable here. (The boot device SSD is already selected).

Screen Shot 2022-08-18 at 11.16.21 AM.png (704×1 px, 141 KB)

The SSDs are showing as an ID 239 and HDD as ID 238 in VD view. Setting the SSDs to bootable option doesn't seem to fix the detection order in bullseye installer. I'm searching older phab tasks linked off T297913 to try to determine how we figured this out before, but I dont recall directly.

While I got the error that everyone else got (no space for data drive as its detecting the ssds as data mount), it seems to be running successfully on my second attempt at wiping config, rebuilding raid arrays, setting boot order, and running the script.

In double checking the checklist steps, I can see that these haven't yet been received in on the coupa PO, so I unchecked that box in the kafka-stretch2002 section of the checklist in the task description.

RobH added a parent task: Unknown Object (Task).Aug 18 2022, 6:39 PM

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye completed:

kafka-stretch2001 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208181817_robh_1283571_kafka-stretch2001.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged

RobH updated the task description. (Show Details)Aug 18 2022, 6:55 PM

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-stretch2002.codfw.wmnet with OS bullseye

RobH unsubscribed.Aug 18 2022, 8:28 PM

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-stretch2002.codfw.wmnet with OS bullseye executed with errors:

kafka-stretch2002 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-stretch2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-stretch2002.codfw.wmnet with OS bullseye completed:

kafka-stretch2002 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208182039_pt1979_639827_kafka-stretch2002.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged

@Ottomata all yours thanks for the help

Thank you!

We'll need T314156: Q1:rack/setup/install kafka-stretch100[12] before we can proceed. Thanks!

Complete

Ottomata mentioned this in T314156: Q1:rack/setup/install kafka-stretch100[12].Oct 27 2022, 6:24 PM

Ottomata mentioned this in T306939: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5].Nov 21 2022, 6:43 PM

Hello, just FYI I reimaged kafka-stretch2002 because the /dev/sda and /dev/sdb were the wrong way around.

I've been investigating the install issues with kafka-stretch1001 in T314156: Q1:rack/setup/install kafka-stretch100[12] but I think it's still something strange about the way that the RAID controller is configured.

Now kafka-stretch2001 is the only one of these four kafka-stretch hosts left with the drive order reversed.

btullis@cumin1001:~$ sudo cumin kafka-stretch* 'lsblk'
4 hosts will be targeted:
kafka-stretch[2001-2002].codfw.wmnet,kafka-stretch[1001-1002].eqiad.wmnet
Ok to proceed on 4 hosts? Enter the number of affected hosts to confirm or "q" to quit 4
===== NODE GROUP =====                                                                                                                                                                                             
(1) kafka-stretch2001.codfw.wmnet                                                                                                                                                                                  
----- OUTPUT of 'lsblk' -----                                                                                                                                                                                      
NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT                                                                                                                                                                  
sda            8:0    0  21.8T  0 disk                                                                                                                                                                             
└─sda1         8:1    0  21.8T  0 part 
  └─vg1-srv  254:0    0  17.5T  0 lvm  /srv
sdb            8:16   0 446.6G  0 disk 
├─sdb1         8:17   0   285M  0 part 
└─sdb2         8:18   0 446.3G  0 part 
  └─vg0-root 254:1    0 357.1G  0 lvm  /
===== NODE GROUP =====                                                                                                                                                                                             
(3) kafka-stretch2002.codfw.wmnet,kafka-stretch[1001-1002].eqiad.wmnet                                                                                                                                             
----- OUTPUT of 'lsblk' -----                                                                                                                                                                                      
NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT                                                                                                                                                                  
sda            8:0    0 446.6G  0 disk                                                                                                                                                                             
├─sda1         8:1    0   285M  0 part 
└─sda2         8:2    0 446.3G  0 part 
  └─vg0-root 254:0    0 357.1G  0 lvm  /
sdb            8:16   0  21.8T  0 disk 
└─sdb1         8:17   0  21.8T  0 part 
  └─vg1-srv  254:1    0  17.5T  0 lvm  /srv
================

Reimaging it now.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye executed with errors:

kafka-stretch2001 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212151459_btullis_1821231_kafka-stretch2001.out
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

OK, I think that both of these two hosts are set up correctly now. The failure in the cookbook above was only a delayed install of perccli64 which doesn't warrant another reimage.

So all four kafka-stretch servers no have the correct RAID configuration and ordering of /dev/sda and /dev/sdb.

 btullis@cumin1001:~$ sudo cumin kafka-stretch* lsblk
4 hosts will be targeted:
kafka-stretch[2001-2002].codfw.wmnet,kafka-stretch[1001-1002].eqiad.wmnet
Ok to proceed on 4 hosts? Enter the number of affected hosts to confirm or "q" to quit 4
===== NODE GROUP =====                                                                                                                                                                                             
(4) kafka-stretch[2001-2002].codfw.wmnet,kafka-stretch[1001-1002].eqiad.wmnet                                                                                                                                      
----- OUTPUT of 'lsblk' -----                                                                                                                                                                                      
NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT                                                                                                                                                                  
sda            8:0    0 446.6G  0 disk                                                                                                                                                                             
├─sda1         8:1    0   285M  0 part 
└─sda2         8:2    0 446.3G  0 part 
  └─vg0-root 254:0    0 357.1G  0 lvm  /
sdb            8:16   0  21.8T  0 disk 
└─sdb1         8:17   0  21.8T  0 part 
  └─vg1-srv  254:1    0  17.5T  0 lvm  /srv
================                                                                                                                                                                                                   
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (4/4) [00:00<00:00,  4.60hosts/s]
FAIL |                                                                                                                                                                             |   0% (0/4) [00:00<?, ?hosts/s]
100.0% (4/4) success ratio (>= 100.0% threshold) for command: 'lsblk'.
100.0% (4/4) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

YAYYY

Ottomata mentioned this in T340492: [Epic] Set up multi DC Kafka stretch cluster.Jun 26 2023, 8:39 PM

Status	Assigned	Task
Resolved	Ottomata	T185233 Modern Event Platform
Resolved	lbowmaker	T306797 [Shared Event Platform] Investigate Stream Processing Platforms
Resolved	Ottomata	T307944 Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP
		Unknown Object (Task)
Resolved	Papaul	T314160 Q1:rack/setup/install kafka-stretch200[12]

Q1:rack/setup/install kafka-stretch200[12]
Closed, ResolvedPublic
Actions

Description

Hostname / Racking / Installation Details

Per host setup checklist

kafka-stretch2001:

kafka-stretch2002:

Details

Related Objects
Search...

Event Timeline

	F35475687: Screen Shot 2022-08-18 at 11.16.21 AM.png
	Aug 18 2022, 6:20 PM

	F35861835: image.png
	Dec 13 2022, 5:02 PM

Q1:rack/setup/install kafka-stretch200[12]Closed, ResolvedPublicActions