Page MenuHomePhabricator

Q1:rack/setup/install kafka-stretch100[12]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of kafka-stretch100[12]

Hostname / Racking / Installation Details

Hostnames: kafka-stretch100[12]
Racking Proposal: In different rows please.
Networking Setup:

  • # of Connections: 1
  • Speed: 10G.
  • Vlan: Private
  • AAAA records:Y,
  • Additional IP records: no

Partitioning/Raid: Same as kafka-jumbo1009
OS Distro: Bullseye
Sub-team Technical Contact: @Ottomata and @BTullis

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

kafka-stretch1001:
  • - receive in system on procurement task T311865 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
kafka-stretch1002:
  • - receive in system on procurement task T311865 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH added a parent task: Unknown Object (Task).
RobH mentioned this in Unknown Object (Task).
RobH unsubscribed.

Adding Event Platform tag, we decided to get this hardware to hopefully better support multi DC event stream processing.

kafka-stretch1001 E3 U17 Port 17 cableid 20220230
kafka-stretch1002 F3 U17 Port 17 cableid 20220229

@Cmjohnson just checking in on these. Status update? Not a huge hurry, but we might want to start working on these in late October / early November.

Change 836914 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding site.pp and netboo for kafka-stretch

https://gerrit.wikimedia.org/r/836914

Change 836914 merged by Cmjohnson:

[operations/puppet@production] Adding site.pp and netboo for kafka-stretch

https://gerrit.wikimedia.org/r/836914

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye executed with errors:

  • kafka-stretch1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye

@Ottomata this is failing in the installer because of the raid configuration. I probably do not have it set correctly. Can you give me the specific hardware raid config? Ex: 2 ssds raid 1 and larger disks raid 10. Thanks!

What's the error you are getting? See https://phabricator.wikimedia.org/T314160#8166075 and below. In codfw, sda and sdb were mapped to the wrong drives. sda should be SSDs and sdb should be the HDDs. Is that happening here too?

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye executed with errors:

  • kafka-stretch1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

@Papaul When I try to image these servers, the process fails immediately. This is the error I receive. Any ideas on what is wrong?

Running IPMI command: ipmitool -I lanplus -H kafka-stretch1001.mgmt.eqiad.wmnet -U root -E chassis power status
START - Cookbook sre.hosts.reimage for host kafka-stretch1001.eqiad.wmnet with OS bullseye
Updated Phabricator task T314156

  • OUTPUT of 'puppet node clea...1001.eqiad.wmnet' -----

kafka-stretch1001.eqiad.wmnet

100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet node clea...1001.eqiad.wmnet'.

  • OUTPUT of 'puppet node deac...1001.eqiad.wmnet' -----

Submitted 'deactivate node' for kafka-stretch1001.eqiad.wmnet with UUID 93af01fd-021b-47d7-b5a6-d8995f7bfb35

100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet node deac...1001.eqiad.wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Removed from Puppet and PuppetDB if present

  • OUTPUT of 'puppet ca --disa...1001.eqiad.wmnet' -----

Nothing was deleted

100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet ca --disa...1001.eqiad.wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Deleted any existing Puppet certificate
Host kafka-stretch1001.eqiad.wmnet already missing on Debmonitor
Removed from Debmonitor if present
Exception raised while executing cookbook sre.hosts.reimage:
Traceback (most recent call last):

File "/usr/lib/python3/dist-packages/spicerack/dhcp.py", line 232, in push_configuration
  self._hosts.run_sync(
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 520, in run_sync
  return self._execute(
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 720, in _execute
  raise RemoteExecutionError(ret, "Cumin execution failed")

spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
  raw_ret = runner.run()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 492, in run
  with self.dhcp.config(self.dhcp_config):
File "/usr/lib/python3.9/contextlib.py", line 117, in __enter__
  return next(self.gen)
File "/usr/lib/python3/dist-packages/spicerack/dhcp.py", line 299, in config
  self.push_configuration(dhcp_config)
File "/usr/lib/python3/dist-packages/spicerack/dhcp.py", line 239, in push_configuration
  raise DHCPError(f"target file {filename} exists") from exc

spicerack.dhcp.DHCPError: target file ttyS1-115200/kafka-stretch1001.conf exists
The reimage failed, see the cookbook logs for the details
Reimage executed with errors:

  • kafka-stretch1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Updated Phabricator task T314156
END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-stretch1001.eqiad.wmnet with OS bullseye
cmjohnson@cumin1001:~$

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-stretch1002.eqiad.wmnet with OS bullseye

@Cmjohnson try to delete the kafka-stretch1001.conf on install1003 and try again and let me know

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-stretch1002.eqiad.wmnet with OS bullseye completed:

  • kafka-stretch1002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212091627_cmjohnson_297670_kafka-stretch1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

KS1002 was installed without an issue, I started over with KS1001 but the mgmt IP address changed and the provision script didn't work. I asked @Jclark-ctr to manually change when he gets an opportunity

I tried a reinstall of kafka-stretch2002 with slightly different RAID controller settings, but that didn't work either. This is captured from a rescue environment from a failed install.

root@kafka-stretch2002:/# lsblk                                                 
NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda            8:0    0  21.8T  0 disk 
|-sda1         8:1    0   285M  0 part 
`-sda2         8:2    0  21.8T  0 part 
  `-vg0-root 253:0    0  17.5T  0 lvm  /
sdb            8:16   0 446.6G  0 disk 
`-sdb1         8:17   0 446.6G  0 part 
root@kafka-stretch2002:/#

Tried once more.

I think it might be OK on kafka-stretch2002 now.
It's successfully run the installer and booted.

lsblk looks like this now.

root@kafka-stretch2002:~# lsblk
NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda            8:0    0 446.6G  0 disk 
├─sda1         8:1    0   285M  0 part 
└─sda2         8:2    0 446.3G  0 part 
  └─vg0-root 254:0    0 357.1G  0 lvm  /
sdb            8:16   0  21.8T  0 disk 
└─sdb1         8:17   0  21.8T  0 part 
  └─vg1-srv  254:1    0  17.5T  0 lvm  /srv

I should have written the comment above on the kafka-stretch ticket for codfw(T314160) despite the fact that it was resolved..

Mentioned in SAL (#wikimedia-operations) [2022-12-13T17:22:23Z] <btullis> btullis@install1003:/etc/dhcp/automation/ttyS1-115200$ sudo systemctl restart isc-dhcp-server.service T314156

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye executed with errors:

  • kafka-stretch1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

OK, I cleaned up the failed bit of DHCP automation that was causing the cookbook to fail on kafka-stretch1001.
Now we're back to the situation where the RAID configuration appears to be incorrect.

image.png (879×1 px, 96 KB)

You asked for 383.6 GB to be used for guided partitioning, but the  │ │   
  │ │ selected partitioning recipe requires at least 6.0 TB.

This is the same error that we currently have with: kafka-jumbo101[0-5]: T306939#8410293

Also, I've just seen this again with kafka-stretch2002 whilst investigating this incorrect ordering of /dev/sda and /dev/sdb: T314160#8464012
That second one I fixed by recreating the RAID configuration manually.

@Cmjohnson - @RobH - @Papaul - Should I try manually recreating the RAID configuration on this host: (kafka-stretch1001) or is is something that you would rather investigate?

When I looked at this boot ordering before I said (T297913#8037638) that I'd update the RAID setup page about this hardware combination on wikitech, but I clearly didn't actually do that.
So I'm sorry about that.

Do we have a record of how the RAID setup was originally done on these hosts? I see that the provisioning cookbook can't tackle the RAID config yet, so is it done manually each time?
https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Initial_System_Setup

@BTullis yes, if you want to recreate the raid manually then please do.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye completed:

  • kafka-stretch1001 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212151236_btullis_1791883_kafka-stretch1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

kafka-stretch1001 worked ok with the new raid config.

I'm just going to rebuild kafka-stretch1002 because although the drives are in the right order on this host, the RAID type is RAID6 instead of RAID10.

btullis@kafka-stretch1002:~$ sudo perccli64 /c0 /vall show
CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-19-amd64
Controller = 0
Status = Success
Description = None


Virtual Drives :
==============

---------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC       Size Name 
---------------------------------------------------------------
1/238 RAID1 Optl  RW     Yes     RWBD  -   OFF 446.625 GB      
0/239 RAID6 Optl  RW     Yes     RWBD  -   OFF  36.381 TB      
---------------------------------------------------------------

VD=Virtual Drive| DG=Drive Group|Rec=Recovery
Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|Dgrd=Degraded
Optl=Optimal|dflt=Default|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady
B=Blocked|Consist=Consistent|R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack
FWB=Force WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency

The three other kafka-strech hosts are already set to RAID10.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch1002.eqiad.wmnet with OS bullseye completed:

  • kafka-stretch1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212151351_btullis_1805796_kafka-stretch1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

I think that there are all done now.