Page MenuHomePhabricator

Q3:(Need By: TBD) rack/setup/install ms-be20[66-69]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ms-be20[66-69]

Hostname / Racking / Installation Details

Hostnames: ms-be20[66-69]
Racking Proposal: One host per row
Networking/Subnet/VLAN/IP: 10G private VLAN
Partitioning/Raid: Same as existing ms-be
OS Distro: Stretch

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ms-be2066 A4 U7 xe-4/0/6
  • - receive in system on procurement task T297732 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ms-be2067 B4 U9 xe-4/0/8
  • - receive in system on procurement task T297732 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ms-be2068 C2 U3 xe-2/0/2
  • - receive in system on procurement task T297732 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ms-be2069 D2 U5 xe-2/0/4
  • - receive in system on procurement task T297732 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2067.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2067.codfw.wmnet with OS stretch executed with errors:

  • ms-be2067 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2067.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2067.codfw.wmnet with OS stretch completed:

  • ms-be2067 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh stretch OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202230004_pt1979_598903_ms-be2067.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch executed with errors:

  • ms-be2066 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh stretch OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202230108_pt1979_609327_ms-be2066.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch

@fgiunchedi puppet is failed on ms-be2067, ms-be2068 with the error below. if you back online can you please check? thanks

Error: 'parted --script --align optimal /dev/sdz mklabel gpt mkpart swift-sdz1 1M 100%' returned 1 instead of one of [0]
Error: /Stage[main]/Profile::Swift::Storage/Swift::Init_device[/dev/sdz]/Exec[parted-/dev/sdz]/returns: change from 'notrun' to ['0'] failed: 'parted --script --align optimal /dev/sdz mklabel gpt mkpart swift-sdz1 1M 100%' returned 1 instead of one of [0] (corrective)

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch executed with errors:

  • ms-be2066 (FAIL)
    • Downtimed on Icinga
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh stretch OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202230151_pt1979_616182_ms-be2066.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch executed with errors:

  • ms-be2068 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh stretch OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202230119_pt1979_609883_ms-be2068.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

@fgiunchedi puppet is failed on ms-be2067, ms-be2068 with the error below. if you back online can you please check? thanks

Error: 'parted --script --align optimal /dev/sdz mklabel gpt mkpart swift-sdz1 1M 100%' returned 1 instead of one of [0]
Error: /Stage[main]/Profile::Swift::Storage/Swift::Init_device[/dev/sdz]/Exec[parted-/dev/sdz]/returns: change from 'notrun' to ['0'] failed: 'parted --script --align optimal /dev/sdz mklabel gpt mkpart swift-sdz1 1M 100%' returned 1 instead of one of [0] (corrective)

Thank you for the heads up -- I'll check puppet on 2067. re: 2068 it looks like not all disks are showing up to linux (drive 2 is missing from the list below) and thus sdz is missing. 2066 has the same problem, and indeed it looks like one drive is showing up as unconfigured

root@ms-be2068:~# megacli -PdList -aALL | grep state:
Firmware state: Unconfigured(good), Spun Up
root@ms-be2068:~# ls -la /dev/disk/by-path/ | grep -v part | sort -k11
drwxr-xr-x 2 root root 1160 Feb 23 01:51 .
drwxr-xr-x 8 root root  160 Feb 23 01:51 ..
total 0
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:0:0 -> ../../sda
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:1:0 -> ../../sdb
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:3:0 -> ../../sdc
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:4:0 -> ../../sdd
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:5:0 -> ../../sde
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:6:0 -> ../../sdf
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:7:0 -> ../../sdg
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:8:0 -> ../../sdh
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:9:0 -> ../../sdi
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:10:0 -> ../../sdj
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:11:0 -> ../../sdk
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:12:0 -> ../../sdl
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:13:0 -> ../../sdm
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:14:0 -> ../../sdn
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:15:0 -> ../../sdo
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:16:0 -> ../../sdp
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:17:0 -> ../../sdq
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:18:0 -> ../../sdr
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:19:0 -> ../../sds
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:20:0 -> ../../sdt
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:21:0 -> ../../sdu
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:22:0 -> ../../sdv
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:23:0 -> ../../sdw
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:24:0 -> ../../sdx
lrwxrwxrwx 1 root root    9 Feb 23 01:51 pci-0000:18:00.0-scsi-0:2:25:0 -> ../../sdy

@fgiunchedi thanks will check and see why the drive is missing.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch executed with errors:

  • ms-be2068 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh stretch OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202231448_pt1979_707049_ms-be2068.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2068.codfw.wmnet with OS stretch completed:

  • ms-be2068 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh stretch OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202231536_pt1979_714995_ms-be2068.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch executed with errors:

  • ms-be2066 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2066.codfw.wmnet with OS stretch completed:

  • ms-be2066 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202231721_pt1979_727327_ms-be2066.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2069.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2069.codfw.wmnet with OS stretch executed with errors:

  • ms-be2069 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2069.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2069.codfw.wmnet with OS stretch completed:

  • ms-be2069 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh stretch OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202232250_pt1979_767134_ms-be2069.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

@fgiunchedi this is complete after long hours of workaround because puppet wasn't happy at

mkfs on /dev/sdc1

hopefully we have this fix in the future.

Thanks.

Thank you for your persistence on this @Papaul, indeed the disk ordering issue is known :( we don't have a great story on how to re-init all disks if some are missing before the first puppet run. What I've done for now is strip the filesystems and let puppet re-create them.

cc @MatthewVernon for visibility, as the hosts are otherwise good to go now