Page MenuHomePhabricator

Q1:rack/setup/install druid10[09-11]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of druid10[09-11]

Hostname / Racking / Installation Details

Hostnames: druid10[09-11]
Racking Proposal: Please distribute anywhere between rows A-F
Networking Setup: 1 x 1Gb connection each please - private vlan (analytics vlan not required)
Partitioning/Raid: HW Raid: N, Partman recipe partman/raid10-8dev.cfg
OS Distro: Bullseye
Sub-team Technical Contact: Data-Engineering team - @BTullis SRE contact

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

druid1009:
  • - receive in system on procurement task T311755 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
druid1010:
  • - receive in system on procurement task T311755 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
druid1011:
  • - receive in system on procurement task T311755 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Related Objects

StatusSubtypeAssignedTask
Resolved Cmjohnson

Event Timeline

RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH mentioned this in Unknown Object (Task).
RobH updated the task description. (Show Details)
RobH added a parent task: Unknown Object (Task).
This comment was removed by Jclark-ctr.
This comment was removed by Jclark-ctr.

druid1009 E2 U38 Port43 Cableid 23000023
druid1010 E3 U38 Port43 Cableid 23000054
druid1011 F2 U38 Port43 Cableid 23000028

The dns has been updated but I am not getting any mgmt connection, I need to check to make sure the mgmt cables are connected.

Verified mgmt cables they are connected and have link

The mgmt links are still not working, The DNS is correct but I am unable to ping the servers.

@Cmjohnson looked at druid10[09-11] bios has not been configured yet. no ip address in set for idrac have you ran the bios script yet?

@BTullis can you please specify the exact partman recipe to use?
Thanks

Hi @Papaul - apologies for the delay in getting back to you.

Please could we use the partman/raid10-8dev.cfg recipe?

I think that this will require setting all of the disks to be 'non-RAID' mode, from the RAID controller configuration in the BIOS, as described here.

We could configure RAID10 in hardware if we have any issues with it this software RAID approach, but I would rather try this MD RAID approach first. Thanks.

Change 881924 had a related patch set uploaded (by Jclark-ctr; author: jclark):

[operations/puppet@production] new servers druid10[09-11] adding basic install info

https://gerrit.wikimedia.org/r/881924

Change 881924 merged by Papaul:

[operations/puppet@production] new servers druid10[09-11] adding basic install info

https://gerrit.wikimedia.org/r/881924

Change 881929 had a related patch set uploaded (by Jclark-ctr; author: jclark):

[operations/puppet@production] new servers druid10[09-11] adding basic install info

https://gerrit.wikimedia.org/r/881929

Change 881929 merged by Papaul:

[operations/puppet@production] new servers druid10[09-11] adding basic install info

https://gerrit.wikimedia.org/r/881929

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host druid1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host druid1010.eqiad.wmnet with OS bullseye completed:

  • druid1010 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202301241420_jclark_161590_druid1010.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host druid1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host druid1011.eqiad.wmnet with OS bullseye completed:

  • druid1011 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202301241906_jclark_304421_druid1011.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host druid1009.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host druid1009.eqiad.wmnet with OS bullseye completed:

  • druid1009 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202301242107_jclark_365952_druid1009.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)