⚓ T314156 Q1:rack/setup/install kafka-stretch100[12]

	Subject	Repo	Branch	Lines +/-
	Adding site.pp and netboo for kafka-stretch	operations/puppet	production	+6 -1

RobH created this task.Jul 29 2022, 3:12 PM

RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.

RobH added a parent task: Unknown Object (Task).

RobH mentioned this in Unknown Object (Task).

RobH unsubscribed.

JArguello-WMF moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Aug 2 2022, 5:34 PM

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Aug 3 2022, 12:18 PM

• EChetty moved this task from Backlog to Shared Data Infra on the Data-Engineering-Planning board.Aug 15 2022, 8:44 AM

• EChetty added a project: Shared-Data-Infrastructure.Aug 15 2022, 8:50 AM

• EChetty moved this task from Backlog to To be discussed on the Shared-Data-Infrastructure board.Aug 15 2022, 9:06 AM

Adding Event Platform tag, we decided to get this hardware to hopefully better support multi DC event stream processing.

• EChetty moved this task from To be discussed to Estimated/Discussed on the Shared-Data-Infrastructure board.Aug 16 2022, 1:44 PM

Ottomata updated the task description. (Show Details)Aug 16 2022, 6:33 PM

Ottomata mentioned this in T314160: Q1:rack/setup/install kafka-stretch200[12].Aug 19 2022, 5:41 PM

Jclark-ctr updated the task description. (Show Details)Sep 2 2022, 1:00 PM

kafka-stretch1001 E3 U17 Port 17 cableid 20220230
kafka-stretch1002 F3 U17 Port 17 cableid 20220229

Jclark-ctr reassigned this task from Jclark-ctr to • Cmjohnson.Sep 2 2022, 1:31 PM

Jclark-ctr updated the task description. (Show Details)

Jclark-ctr subscribed.

• EChetty moved this task from Estimated/Discussed to To be discussed on the Shared-Data-Infrastructure board.Sep 6 2022, 1:23 PM

Ottomata mentioned this in T296064: Move Kafka Jumbo's TLS clients to the new bundle.Sep 20 2022, 2:30 PM

• EChetty moved this task from To be discussed to Discussed (Tracking) on the Shared-Data-Infrastructure board.Sep 27 2022, 1:51 PM

@Cmjohnson just checking in on these. Status update? Not a huge hurry, but we might want to start working on these in late October / early November.

• Cmjohnson updated the task description. (Show Details)Sep 29 2022, 7:07 PM

Change 836914 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding site.pp and netboo for kafka-stretch

https://gerrit.wikimedia.org/r/836914

gerritbot added a project: Patch-For-Review.Sep 29 2022, 7:29 PM

Change 836914 merged by Cmjohnson:

[operations/puppet@production] Adding site.pp and netboo for kafka-stretch

https://gerrit.wikimedia.org/r/836914

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye

Maintenance_bot removed a project: Patch-For-Review.Sep 29 2022, 8:30 PM

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye executed with errors:

kafka-stretch1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye

@Ottomata this is failing in the installer because of the raid configuration. I probably do not have it set correctly. Can you give me the specific hardware raid config? Ex: 2 ssds raid 1 and larger disks raid 10. Thanks!

What's the error you are getting? See https://phabricator.wikimedia.org/T314160#8166075 and below. In codfw, sda and sdb were mapped to the wrong drives. sda should be SSDs and sdb should be the HDDs. Is that happening here too?

@Ottomata yes, that is what's happening here

K, looks like RobH was able to fix it somehow.

wiki_willy moved this task from Racking Tasks to Remote Work on the ops-eqiad board.Nov 22 2022, 7:53 PM

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye executed with errors:

kafka-stretch1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- The reimage failed, see the cookbook logs for the details

@Papaul When I try to image these servers, the process fails immediately. This is the error I receive. Any ideas on what is wrong?

Running IPMI command: ipmitool -I lanplus -H kafka-stretch1001.mgmt.eqiad.wmnet -U root -E chassis power status
START - Cookbook sre.hosts.reimage for host kafka-stretch1001.eqiad.wmnet with OS bullseye
Updated Phabricator task T314156

OUTPUT of 'puppet node clea...1001.eqiad.wmnet' -----

kafka-stretch1001.eqiad.wmnet

100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet node clea...1001.eqiad.wmnet'.

OUTPUT of 'puppet node deac...1001.eqiad.wmnet' -----

Submitted 'deactivate node' for kafka-stretch1001.eqiad.wmnet with UUID 93af01fd-021b-47d7-b5a6-d8995f7bfb35

100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet node deac...1001.eqiad.wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Removed from Puppet and PuppetDB if present

OUTPUT of 'puppet ca --disa...1001.eqiad.wmnet' -----

Nothing was deleted

100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet ca --disa...1001.eqiad.wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Deleted any existing Puppet certificate
Host kafka-stretch1001.eqiad.wmnet already missing on Debmonitor
Removed from Debmonitor if present
Exception raised while executing cookbook sre.hosts.reimage:
Traceback (most recent call last):

File "/usr/lib/python3/dist-packages/spicerack/dhcp.py", line 232, in push_configuration
  self._hosts.run_sync(
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 520, in run_sync
  return self._execute(
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 720, in _execute
  raise RemoteExecutionError(ret, "Cumin execution failed")

spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
  raw_ret = runner.run()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 492, in run
  with self.dhcp.config(self.dhcp_config):
File "/usr/lib/python3.9/contextlib.py", line 117, in __enter__
  return next(self.gen)
File "/usr/lib/python3/dist-packages/spicerack/dhcp.py", line 299, in config
  self.push_configuration(dhcp_config)
File "/usr/lib/python3/dist-packages/spicerack/dhcp.py", line 239, in push_configuration
  raise DHCPError(f"target file {filename} exists") from exc

spicerack.dhcp.DHCPError: target file ttyS1-115200/kafka-stretch1001.conf exists
The reimage failed, see the cookbook logs for the details
Reimage executed with errors:

kafka-stretch1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- The reimage failed, see the cookbook logs for the details

Updated Phabricator task T314156
END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-stretch1001.eqiad.wmnet with OS bullseye
cmjohnson@cumin1001:~$

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-stretch1002.eqiad.wmnet with OS bullseye

@Cmjohnson try to delete the kafka-stretch1001.conf on install1003 and try again and let me know

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-stretch1002.eqiad.wmnet with OS bullseye completed:

kafka-stretch1002 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212091627_cmjohnson_297670_kafka-stretch1002.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

KS1002 was installed without an issue, I started over with KS1001 but the mgmt IP address changed and the provision script didn't work. I asked @Jclark-ctr to manually change when he gets an opportunity

Ottomata mentioned this in T306939: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5].Dec 13 2022, 4:01 PM

I tried a reinstall of kafka-stretch2002 with slightly different RAID controller settings, but that didn't work either. This is captured from a rescue environment from a failed install.

root@kafka-stretch2002:/# lsblk                                                 
NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda            8:0    0  21.8T  0 disk 
|-sda1         8:1    0   285M  0 part 
`-sda2         8:2    0  21.8T  0 part 
  `-vg0-root 253:0    0  17.5T  0 lvm  /
sdb            8:16   0 446.6G  0 disk 
`-sdb1         8:17   0 446.6G  0 part 
root@kafka-stretch2002:/#

Tried once more.

I think it might be OK on kafka-stretch2002 now.
It's successfully run the installer and booted.

lsblk looks like this now.

root@kafka-stretch2002:~# lsblk
NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda            8:0    0 446.6G  0 disk 
├─sda1         8:1    0   285M  0 part 
└─sda2         8:2    0 446.3G  0 part 
  └─vg0-root 254:0    0 357.1G  0 lvm  /
sdb            8:16   0  21.8T  0 disk 
└─sdb1         8:17   0  21.8T  0 part 
  └─vg1-srv  254:1    0  17.5T  0 lvm  /srv

I should have written the comment above on the kafka-stretch ticket for codfw(T314160) despite the fact that it was resolved..

Mentioned in SAL (#wikimedia-operations) [2022-12-13T17:22:23Z] <btullis> btullis@install1003:/etc/dhcp/automation/ttyS1-115200$ sudo systemctl restart isc-dhcp-server.service T314156

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye executed with errors:

kafka-stretch1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

OK, I cleaned up the failed bit of DHCP automation that was causing the cookbook to fail on kafka-stretch1001.
Now we're back to the situation where the RAID configuration appears to be incorrect.

You asked for 383.6 GB to be used for guided partitioning, but the  │ │   
  │ │ selected partitioning recipe requires at least 6.0 TB.

This is the same error that we currently have with: kafka-jumbo101[0-5]: T306939#8410293

Also, I've just seen this again with kafka-stretch2002 whilst investigating this incorrect ordering of /dev/sda and /dev/sdb: T314160#8464012
That second one I fixed by recreating the RAID configuration manually.

@Cmjohnson - @RobH - @Papaul - Should I try manually recreating the RAID configuration on this host: (kafka-stretch1001) or is is something that you would rather investigate?

When I looked at this boot ordering before I said (T297913#8037638) that I'd update the RAID setup page about this hardware combination on wikitech, but I clearly didn't actually do that.
So I'm sorry about that.

Do we have a record of how the RAID setup was originally done on these hosts? I see that the provisioning cookbook can't tackle the RAID config yet, so is it done manually each time?
https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Initial_System_Setup

@BTullis yes, if you want to recreate the raid manually then please do.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye completed:

kafka-stretch1001 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212151236_btullis_1791883_kafka-stretch1001.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

kafka-stretch1001 worked ok with the new raid config.

I'm just going to rebuild kafka-stretch1002 because although the drives are in the right order on this host, the RAID type is RAID6 instead of RAID10.

btullis@kafka-stretch1002:~$ sudo perccli64 /c0 /vall show
CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-19-amd64
Controller = 0
Status = Success
Description = None


Virtual Drives :
==============

---------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC       Size Name 
---------------------------------------------------------------
1/238 RAID1 Optl  RW     Yes     RWBD  -   OFF 446.625 GB      
0/239 RAID6 Optl  RW     Yes     RWBD  -   OFF  36.381 TB      
---------------------------------------------------------------

VD=Virtual Drive| DG=Drive Group|Rec=Recovery
Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|Dgrd=Degraded
Optl=Optimal|dflt=Default|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady
B=Blocked|Consist=Consistent|R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack
FWB=Force WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency

The three other kafka-strech hosts are already set to RAID10.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch1002.eqiad.wmnet with OS bullseye completed:

kafka-stretch1002 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212151351_btullis_1805796_kafka-stretch1002.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

BTullis updated the task description. (Show Details)Dec 15 2022, 2:46 PM

I think that there are all done now.

WOWWW THANK YOU!

Q1:rack/setup/install kafka-stretch100[12]
Closed, ResolvedPublic
Actions

Description

Hostname / Racking / Installation Details

Per host setup checklist

kafka-stretch1001:

kafka-stretch1002:

Details

Related Objects
Search...

Event Timeline

kafka-stretch1001.eqiad.wmnet

Submitted 'deactivate node' for kafka-stretch1001.eqiad.wmnet with UUID 93af01fd-021b-47d7-b5a6-d8995f7bfb35

Nothing was deleted

		Status	Subtype	Assigned	Task
					Unknown Object (Task)
		Resolved		• Cmjohnson	T314156 Q1:rack/setup/install kafka-stretch100[12]

	F35861929: image.png
	Dec 13 2022, 6:55 PM

Q1:rack/setup/install kafka-stretch100[12]Closed, ResolvedPublicActions

Description

Hostname / Racking / Installation Details

Per host setup checklist

kafka-stretch1001:

kafka-stretch1002:

Details

Related ObjectsSearch...

Event Timeline

kafka-stretch1001.eqiad.wmnet

Submitted 'deactivate node' for kafka-stretch1001.eqiad.wmnet with UUID 93af01fd-021b-47d7-b5a6-d8995f7bfb35

Nothing was deleted

Q1:rack/setup/install kafka-stretch100[12]
Closed, ResolvedPublic
Actions

Related Objects
Search...