Page MenuHomePhabricator

Q2:rack/setup/install dbprov1004
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of dbprov1004.

Please note this host will be one of the first PowerEdge R650s we receive, and may have some implementation hurdles due to that.

Hostname / Racking / Installation Details

Hostnames: dbprov1004
Racking Proposal: dbprov100[123] are on A7 - B7 - C7, as redundant as reasonable - if possible avoiding those rows, if not ok as long as it is not on the same rack to facilitate maintenance later.
Networking Setup: # 1 connection with 10G. Production network. AAAA records:Y - no additional ip records (other than mgmt)
Partitioning/Raid: HW Raid 6 for the hds (first disk) and raid 0 for the ssds (second disk). Recipe: partman/custom/db.cfg (assuming sda is the hds)
OS Distro: Bullseye
Sub-team Technical Contact: Jaime (this is a backups host); as a backup, anyone from data persistence

Per host setup checklist

dbprov1004:
  • - receive in system on procurement task T319442 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Related Objects

StatusSubtypeAssignedTask
Resolved Cmjohnson

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.
Jclark-ctr subscribed.

dbprov1004 D7 U31 cableID 4901 port19

@Jclark-ctr FYI I pushed the config for this port to the switch with Homer now.

Change 859625 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add puppetdb1003 and dbprov1004 to site.pp and netboox.cfg

https://gerrit.wikimedia.org/r/859625

Change 859625 merged by Papaul:

[operations/puppet@production] Add puppetdb1003 and dbprov1004 to site.pp and netboox.cfg

https://gerrit.wikimedia.org/r/859625

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbprov1004.eqiad.wmnet with OS bullseye completed:

  • dbprov1004 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211222316_pt1979_172251_dbprov1004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
Papaul updated the task description. (Show Details)
Papaul subscribed.

@jcrespo this is done