Page MenuHomePhabricator

Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of elastic108[4-8]

Hostname / Racking / Installation Details

Hostnames: elastic108[4-8]
Racking Proposal:

If possible, we'd like to match the rows of the hosts being replaced:

elastic1048 - A6
elastic1049 - B4
elastic1050 - B4
elastic1051 - C7
elastic1052 - C7

If matching the rows exactly is not possible, just try to balance the 5 new (refresh) hosts across the available rows as best as possible.

Additionally, please note that these must be placed in rows with 10G ports

Networking/Subnet/VLAN/IP: 10G, same subnet as existing elastic10* servers (private VLAN)
Partitioning/Raid: RAID0 software (elasticsearch-raid0.cfg - already configured for elastic* in netboot.cfg)
OS Distro: Buster (default unless otherwise specified)

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

elastic1084:

  • - receive in system on procurement task T291655 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1085:

  • - receive in system on procurement task T291655 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1086:

  • - receive in system on procurement task T291655 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1087:

  • - receive in system on procurement task T291655 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1088:

  • - receive in system on procurement task T291655 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH removed a subscriber: RobH.

@RKemper rack A6 is not 10g rack b4 has no space. Are there any other requirements before i rack in same rows 10g racks?

@RKemper rack A6 is not 10g rack b4 has no space. Are there any other requirements before i rack in same rows 10g racks?

@Jclark-ctr No hard requirements, we just want to try to have the whole fleet distributed across available rows as best as reasonably possible (but not at the cost of not being in a 10G row, ofc). Let me know if you want me to give specific suggestions and I'll take a closer look on Monday; otherwise I trust whatever you think is best.

Jclark-ctr added a subscriber: Jclark-ctr.

elastic1084 A4 U3 Cableid#1214202101 Port#12
elastic1085 B7 U12 Cableid#1214202102 Port#5
elastic1086 B7 U32 Cableid#1214202103 Port#35
elastic1087 C7 U21 Cableid#1214202104 Port#31
elastic1088 C7 U41 Cableid#1214202105 Port#33

I've setup the above hosts with the exception of elastic1085. I'm running some additional tests on in. I've updated the task description accordingly. I plan to give back also the last one fully configured tomorrow morning. Thanks for the collaboration

@Cmjohnson @Jclark-ctr All done from my side. BIOS/iDRAC should be all set up. It was done via the new cookbook. If you can have a double check to make sure all looks good that would be great.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1084.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1085.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1086.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1087.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1088.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1086.eqiad.wmnet with OS buster executed with errors:

  • elastic1086 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1088.eqiad.wmnet with OS buster executed with errors:

  • elastic1088 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1084.eqiad.wmnet with OS buster completed:

  • elastic1084 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201031600_cmjohnson_28496_elastic1084.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1085.eqiad.wmnet with OS buster completed:

  • elastic1085 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201031604_cmjohnson_31414_elastic1085.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1087.eqiad.wmnet with OS buster completed:

  • elastic1087 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201031605_cmjohnson_2406_elastic1087.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1086.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host elastic1088.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1086.eqiad.wmnet with OS buster completed:

  • elastic1086 (WARN)
    • Downtimed on Icinga
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201031629_cmjohnson_28189_elastic1086.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host elastic1088.eqiad.wmnet with OS buster completed:

  • elastic1088 (WARN)
    • Downtimed on Icinga
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201031631_cmjohnson_29380_elastic1088.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
Cmjohnson updated the task description. (Show Details)

completed