Page MenuHomePhabricator

Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev

Hostname / Racking / Installation Details

Hostnames: cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev
Racking Proposal: Use WMCS racks. Cloudcontrol must be placed in a row without other cloudcontrol nodes (which are in C1 and D1). Likewise, cloudgw must not share the same row with cloudgw2002-dev (which is in B5).
Networking/Subnet/VLAN/IP:
cloudcontrol2005-dev: 1 10G connection to public1
clouddb2002-dev: 1 10G connection to cloud-hosts1-codfw
cloudgw2003-dev: 2 10G connections. eth0/eno1 on cloud-hosts1-b-codfw (vlan 2118) 10.192.20.0/24 (untagged), eth1/eno2 on a trunk with cloud-instances-transport1-b-codfw (vlan 2120) and cloud-gw-transport-codfw (vlan 2107) with no address allocated)
Partitioning/Raid: RAID10 (hardware raid) and then 'partman/standard.cfg partman/hwraid-1dev.cfg'
OS Distro: Bullseye

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudcontrol2005-dev: ge-1/0/14
  • - receive in system on procurement task T303441 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
clouddb2002-dev: ge-1/0/15
  • - receive in system on procurement task T303441 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cloudgw2003-dev: ge-1/0/[16-17]
  • - receive in system on procurement task T303441 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH mentioned this in Unknown Object (Task).Apr 25 2022, 11:46 PM
RobH added a parent task: Unknown Object (Task).
RobH unsubscribed.

For these boxes, please make one big hardware raid10 out of all the drives, and then use partman/standard.cfg partman/hwraid-1dev.cfg.

Change 809239 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new Cloud node to site.pp and to netboot.cfg

https://gerrit.wikimedia.org/r/809239

Change 809240 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new cloud nodes to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/809240

Change 809239 merged by Papaul:

[operations/puppet@production] Add new Cloud node to site.pp and to netboot.cfg

https://gerrit.wikimedia.org/r/809239

Change 809240 merged by Papaul:

[operations/puppet@production] Add new cloud nodes to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/809240

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2005-dev.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2005-dev.wikimedia.org with OS bullseye completed:

  • cloudcontrol2005-dev (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206281922_pt1979_1421692_cloudcontrol2005-dev.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host clouddb2002-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host clouddb2002-dev.codfw.wmnet with OS bullseye executed with errors:

  • clouddb2002-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host clouddb2002-dev.codfw.wmnet with OS bullseye

Change 809304 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Replae labs-hosts1-b-codfw with cloud-hosts1-b

https://gerrit.wikimedia.org/r/809304

Change 809304 merged by Papaul:

[operations/puppet@production] Replae labs-hosts1-b-codfw with cloud-hosts1-b

https://gerrit.wikimedia.org/r/809304

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host clouddb2002-dev.codfw.wmnet with OS bullseye executed with errors:

  • clouddb2002-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

@ayounsi
On clouddb2002 i was getting the error mesage below

 Failed to retrieve the preconfiguration file              │
   │ The file needed for preconfiguration could not be retrieved from      │
   │ http://apt.wikimedia.org/autoinstall/subnets/labs-hosts1-b-codfw.cfg. │
   │ The installation will proceed in non-automated mode.

the file subnets/labs-hosts1-b-codfw.cfg was replaced with cloud-hosts1-codfw.cfg and never got change in the netboot.cfg file
46         10.192.20.1) echo subnets/labs-hosts1-b-codfw.cfg ;; \

i did a patch to change it to use cloud-hosts1-codfw.cfg

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host clouddb2002-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host clouddb2002-dev.codfw.wmnet with OS bullseye completed:

  • clouddb2002-dev (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206282227_pt1979_1460329_clouddb2002-dev.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudgw2003-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudgw2003-dev.codfw.wmnet with OS bullseye completed:

  • cloudgw2003-dev (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206282320_pt1979_1468378_cloudgw2003-dev.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
Papaul updated the task description. (Show Details)

@Andrew thanks for getting me the partman recipe info. This is complete.

Change 822432 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] clouddb2002-dev: make a db node

https://gerrit.wikimedia.org/r/822432

Change 822432 merged by Andrew Bogott:

[operations/puppet@production] clouddb2002-dev: make a db node

https://gerrit.wikimedia.org/r/822432

Change 822637 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Make cloudcontrol2005-dev a cloudcontrol node

https://gerrit.wikimedia.org/r/822637

Change 822638 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Replace cloudcontrol2001-dev with cloudcontrol2005-dev.

https://gerrit.wikimedia.org/r/822638

Change 822637 merged by Andrew Bogott:

[operations/puppet@production] Make cloudcontrol2005-dev a cloudcontrol node

https://gerrit.wikimedia.org/r/822637

Change 822638 merged by Andrew Bogott:

[operations/puppet@production] Replace cloudcontrol2001-dev with cloudcontrol2005-dev.

https://gerrit.wikimedia.org/r/822638

Change 822643 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/dns@master] wikimediacloud.org: replace cloudcontrol2001-dev with cloudcontrol2005-dev

https://gerrit.wikimedia.org/r/822643

Change 822643 merged by Andrew Bogott:

[operations/dns@master] wikimediacloud.org: replace cloudcontrol2001-dev with cloudcontrol2005-dev

https://gerrit.wikimedia.org/r/822643

Change 822646 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] acme_chief: permit access to cloudcontrol2005-dev

https://gerrit.wikimedia.org/r/822646

Change 822646 merged by Andrew Bogott:

[operations/puppet@production] acme_chief: permit access to cloudcontrol2005-dev

https://gerrit.wikimedia.org/r/822646

cloudcontrol2005-dev and clouddb2002-dev are now in service.

I don't feel confident setting up cloudgw2003-dev so will probably wait on Arturo for that one.