Page MenuHomePhabricator

Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of gerrit2002, contint2002

Hostname / Racking / Installation Details

Hostnames: gerrit2002, contint2002
Racking Proposal: Any 1G rack with public vlan, these are replacing the hosts running these services at codfw.
Networking: single 1G public vlan with IPv4/IPv6
Paritioning: 2 dev raid1
OS: gerrit2002: bullseye, contint2002: buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

contint2002 B5 U19 ge-5/0/20
  • - receive in system on procurement task 299081 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
gerrit2002 B5 U20 ge-5/0/24
  • - receive in system on procurement task 299081 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH unsubscribed.

Change 763252 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add conting2002 and gerrit2002 to site.pp and netboot

https://gerrit.wikimedia.org/r/763252

Change 763252 merged by Papaul:

[operations/puppet@production] Add conting2002 and gerrit2002 to site.pp and netboot

https://gerrit.wikimedia.org/r/763252

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host contint2002.wikimedia.org with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gerrit2002.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host contint2002.wikimedia.org with OS buster completed:

  • contint2002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202161640_pt1979_3692413_contint2002.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gerrit2002.wikimedia.org with OS bullseye completed:

  • gerrit2002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202161658_pt1979_3693591_gerrit2002.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
Papaul updated the task description. (Show Details)

@Dzahn @akosiaris this is complete

Hi @Papaul gerrit2001 is in D5 and the new server gerrit2002 is in B5. Due to the way we want to migrate (T243027#7732585) we need a DNS name gerrit-replica that moves between hosts. And because of the way DNS names are generated we can only move it between hosts in the same row apparently.

Would it be possible to move this host over to D5 please?

@Dzahn any reason this information was not provided to us during the the initial creation of this task and during the installation process?

@Dzahn are you planning on re-imaging the server after the move so I know what approach to take for the IP change?

@Papaul Yea, reimaging is no problem. It's still in "insetup" and I can do it. Pick the easier option for you.

@Dzahn re-imaging will be the easier option.

Thanks

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gerrit2002.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gerrit2002.wikimedia.org with OS bullseye executed with errors:

  • gerrit2002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gerrit2002.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gerrit2002.wikimedia.org with OS bullseye completed:

  • gerrit2002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205191557_pt1979_1432215_gerrit2002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

This is complete