Page MenuHomePhabricator

Q1:rack/setup/install logstash103[67]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of <enter the FQDN/hostname of the hosts being setup here>

Hostname / Racking / Installation Details

Hostnames: logstash1036,logstash1037
Racking Proposal: Avoid sharing racks with other logstash* hosts if possible
Networking Setup: 10G, private, AAAA record ok, copy logstash1035 for connection redundancy (I think it's just one, though)
Partitioning/Raid: Copy logstash1035
OS Distro: Buster
Sub-team Technical Contact: herron | cwhite

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

logstash1036:
  • - receive in system on procurement task T311871 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
logstash1037:
  • - receive in system on procurement task T311871 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Related Objects

StatusSubtypeAssignedTask
Resolved Cmjohnson

Event Timeline

RobH updated the task description. (Show Details)
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH added a parent task: Unknown Object (Task).
RobH unsubscribed.

logstash1036 E1 U26 Port 26 Cableid 20220234
logstash1037 F1 U26 Port 26 Cableid 20220233

@Cmjohnson just a heads up, there was a discrepancy with the port naming here.

You entered the port names for both these servers as "ge-0/0/26". But as the connections are at 10G they should be named "xe-0/0/26". The report caught the mis-match of speeds within the same block of 4: https://netbox.wikimedia.org/extras/reports/results/3824193/

Apologies I'd been working on some input validation in Netbox to prevent this happening (https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/812376) but it's not merged yet.

For now I just renamed them manually in Netbox and re-ran homer to sort it out so no action needed there.

@Jclark-ctr lsw1-e1-eqiad is not reporting any module in port 26, is that properly cabled up? Needs to be re-checked anyway, could be a faulty DAC if there is one there.

Lsw1-f1-eqiad sees the cable from logstash1037 fine, so no problem in the other rack.

@cmooney sorry Dac was not seated completely. all good now

Change 868107 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding logstash1036-37 to site.pp and netboot cfg

https://gerrit.wikimedia.org/r/868107

Change 868107 merged by Cmjohnson:

[operations/puppet@production] Adding logstash1036-37 to site.pp and netboot cfg

https://gerrit.wikimedia.org/r/868107

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host logstash1036.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host logstash1036.eqiad.wmnet with OS buster executed with errors:

  • logstash1036 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host logstash1037.eqiad.wmnet with OS buster

@Jclark-ctr I am getting a media test failure for logstash1037, can you check the cable please

logstash1037 F1 U26 Port 26

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host logstash1037.eqiad.wmnet with OS buster executed with errors:

  • logstash1037 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host logstash1036.eqiad.wmnet with OS buster

Jclark-ctr I am also getting a media test failure on logstash1036, the DAC cable may be plugged into the wrong port.

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host logstash1036.eqiad.wmnet with OS buster executed with errors:

  • logstash1036 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host logstash1036.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host logstash1037.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host logstash1036.eqiad.wmnet with OS buster completed:

  • logstash1036 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202301181859_pt1979_1696954_logstash1036.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host logstash1037.eqiad.wmnet with OS buster completed:

  • logstash1037 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202301181919_pt1979_1699359_logstash1037.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Papaul added subscribers: herron, Papaul.

@herron @colewhite this is complete

lmata moved this task from Inbox to Radar on the observability board.