Page MenuHomePhabricator

Q1:rack/setup/install db1204, db1205
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of db120[45]

Hostname / Racking / Installation Details

Hostnames: db1XXX, the next in sequence for eqiad (db120[45])
Racking Proposal: Anywhere- only restriction is to try to be as redundant as possible among the 2 newly hosts (different rows) as they will be used for redundancy
Networking Setup: the 10G card should be on the private lan, same network as ms-backup1 and dbprov1 hosts. Only that iface needs connectivity (aside from the management). I believe connecting it to a 1G switch is ok and the norm for now, and we don't yet have the capacity to put all db hosts on 10G switches @Marostegui/netops to confirm, I am disconnected from progress on network + dbs.
Partitioning/Raid: HW Raid: 10 level, the usual for maximum disk performance + redundancy on dbs. Partman recipe: partman/custom/db.cfg, the same as the other newly setup dbs.
OS Distro: Bullseye
Sub-team Technical Contact: Jaime @jcrespo (as a backup, @Marostegui or anyone on data persistence should know the basics of db setup, this is a regular db, just will be used for backups)

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

db1204:
  • - receive in system on procurement task T311859 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db1205:
  • - receive in system on procurement task T311859 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.

@jcrespo @Marostegui Those host names have been used I have entered into netbox db1204 , db1205. Please confirm those names will work

db1204 E3 U24 port 20 Cableid 20220227
db1205 F3 U24 port 20 Cableid 20220228

@jcrespo @Marostegui Those host names have been used I have entered into netbox db1204 , db1205. Please confirm those names will work

Those names are ok.

I would suggest to edit the task title and the task description to reflect the new names, as otherwise can be confusing.

jcrespo renamed this task from Q1:rack/setup/install db119[67] to Q1:rack/setup/install db1204, db1205.Sep 23 2022, 9:20 AM
jcrespo updated the task description. (Show Details)
jcrespo updated the task description. (Show Details)
jcrespo updated the task description. (Show Details)

@Jclark-ctr can you please confirm that those servers are connected to a 10G interface.
@Marostegui @jcrespo I am trying to setup those servers and i don't know if the servers should use IPV6 address or not it is not mentioned in the Description

10G is also not absolutely required at the moment. I personally would like to eventually have all dbs in a 10G for a fast backup recovery- and that is why we buy 10G cards, but there is no formal plan for it yet (I believe it will require the network upgrade for that).

Change 861893 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] install_server: Add db1204, db1205 to the config to wipe disks on 1st install

https://gerrit.wikimedia.org/r/861893

Change 861894 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new db node to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/861894

Change 861894 merged by Papaul:

[operations/puppet@production] Add new db node to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/861894

Change 861893 abandoned by Jcrespo:

[operations/puppet@production] install_server: Add db1204, db1205 to the config to wipe disks on setup

Reason:

redundant to 393d32d

https://gerrit.wikimedia.org/r/861893

Waiting on John to connected those servers into 1G port since there are connected to 10G port so i can redo the switch configuration and start the OS install

I have connected it to 1g. it is port 44 now for both servers. The switch will need to be configured for that block to be 1g @Papaul

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1204.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1205.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1204.eqiad.wmnet with OS bullseye completed:

  • db1204 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211291714_pt1979_1815992_db1204.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1205.eqiad.wmnet with OS bullseye completed:

  • db1205 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211291720_pt1979_1816032_db1205.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Papaul updated the task description. (Show Details)

This is complete

@jcrespo do you want/have a tracking task to productionize these hosts?