Page MenuHomePhabricator

Q1:rack/setup/install kafka-logging100[45]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of kafka-logging100[45]

Hostname / Racking / Installation Details

Hostnames: kafka-logging100[45]
Racking Proposal: Per T313960#8128445, do not rack these in the same row as existing kafka-logging hosts which are: kafka-logging1001=A2, kafka-logging1001=C2, kafka-logging1003=D4.
Networking Setup: # of Connections:1, Speed:10G. Vlan:Private AAAA records:N
Partitioning/Raid: HW Raid: Y, Partman recipe and/or desired Raid Level: raid50, using same config as other kafka-logging hosts
OS Distro: Bullseye (default unless otherwise specified)
Sub-team Technical Contact: Who should our on-sites contact with any questions involving system racking and setup? @herron

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

kafka-logging1004:
  • - receive in system on procurement task T313959 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
kafka-logging1005:
  • - receive in system on procurement task T313959 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH added a subscriber: Jclark-ctr.

@herron,

The ordering task lacked racking details, but since we had all the info for the codfw kafka-logging order already, I was able to figure out most of them.

Can you advise what racks/rows you'd ideally prefer these in (or just avoid the existing kafka-logging rows?) via comment and assign to @Jclark-ctr for followup? Thanks!

fgiunchedi subscribed.

Hi @RobH,

re: racking since this is an expansion please allocate to new rows (compared to existing kafka-logging hosts)

thank you!

RobH updated the task description. (Show Details)
RobH unsubscribed.

kafka-logging1004. e2 u30 port30 20220047
kafka-logging1005 f2. u30. port30 20220048

Papaul subscribed.

HW raid setup on kafka-logging1004

Change 835254 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding new kafka-logging servers to site.pp

https://gerrit.wikimedia.org/r/835254

Change 835254 merged by Cmjohnson:

[operations/puppet@production] Adding new kafka-logging servers to site.pp

https://gerrit.wikimedia.org/r/835254

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye executed with errors:

  • kafka-logging1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye executed with errors:

  • kafka-logging1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

@herron for raid setup, are all the disk raid 50? I do not think that the OS will install with that setup? There are 8 750GB SSDs

Partitioning/Raid: HW Raid: Y, Partman recipe and/or desired Raid Level: raid50, using same config as other kafka-logging hosts

Yes the other kafka-logging hosts were switched to raid50 (hardware) to provide additional capacity vs raid10. It should appear to the OS as a single device

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye executed with errors:

  • kafka-logging1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye executed with errors:

  • kafka-logging1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye completed:

  • kafka-logging1004 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202210191608_cmjohnson_2267946_kafka-logging1004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

@Jclark-ctr can you look at kafka-logging1005 and make sure the network cable is connected and the right port. Sorry to bug you on this but the install script fails immediately after typing go and the mgmt password.

@jbond if you have time tomorrow i did get the error below on kafka-logging1004. I checked the upgrade completed with no issue but the cookbook failed with the error below. thanks

Failed to perform GET request to https://10.65.3.35/redfish/v1/Systems/System.Embedded.1?$select=BiosVersion

@Volans i tried ro urn the reimage cookbook on kafka-logging1005 i am getting the error below

raceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 335, in query
    hosts = query.Query(self._config).execute(query_string)
  File "/usr/lib/python3/dist-packages/cumin/query.py", line 65, in execute
    raise InvalidQueryError(
cumin.backends.InvalidQueryError: Unable to parse the query 'D{kafka-logging1005:.eqiad.wmnet}' neither with the default backend 'puppetdb' nor with the global grammar:
puppetdb: Expected end of text, found '{'  (at char 1), (line:1, col:2)
global: Expected end of text, found ':'  (at char 17), (line:1, col:18)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 219, in run
    runner = self.instance.get_runner(args)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 88, in get_runner
    return ReimageRunner(args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 138, in __init__
    self.remote_host = self.remote.query(f'D{{{self.fqdn}}}')  # Use the Direct backend instead
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 337, in query
    raise RemoteError("Failed to execute Cumin query") from e
spicerack.remote.RemoteError: Failed to execute Cumin query

Apart the fact that the host is in planned state in netbox and hence --new is required, the problem is that the DNS record is wrong in Netbox:
https://netbox.wikimedia.org/ipam/ip-addresses/11782/

@jbond if you have time tomorrow i did get the error below on kafka-logging1004. I checked the upgrade completed with no issue but the cookbook failed with the error below. thanks

Failed to perform GET request to https://10.65.3.35/redfish/v1/Systems/System.Embedded.1?$select=BiosVersion

i too ka look at this and as you mentioned the upgrade completed with no issue. it looks like the redfish endpoint was having issues when we tried to make this call. not really sure what to do right now other then adding in some more arbitrary retries and sleeps. im tempted to say lets wait and see if this was a once of or not?

@jbon I think the issue was with what @Volans mentioned above. Didn't have the issue with another node that I worked with yesterday (kafka-jumbo1010) Thanks to both of you

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-logging1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-logging1005.eqiad.wmnet with OS bullseye completed:

  • kafka-logging1005 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211181524_pt1979_555542_kafka-logging1005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Papaul updated the task description. (Show Details)

@herron this is complete