Page MenuHomePhabricator

Q1:rack/setup/install contint1002
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of <enter the FQDN/hostname of the hosts being setup here>

Hostname / Racking / Installation Details

Hostnames: contint1002
Racking Proposal: Any 1G rack with public vlan, these are replacing the hosts running these services at eqiad.
Networking Setup: single 1G public vlan with IPv4/IPv6
Partitioning/Raid: 2 dev raid1
OS Distro: Buster
Sub-team Technical Contact: @LSobanski

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

contint1002:
  • - receive in system on procurement task T311856 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH added a parent task: Unknown Object (Task).
RobH mentioned this in Unknown Object (Task).

contint1002 B1 U38 port38 cableid 23000029

Papaul edited projects, added ops-eqiad; removed ops-codfw.
Papaul subscribed.

I added these to netbox but when I ran the dns script and home, nothing changed.

@Cmjohnson what's the expected ETA for this host? Asking as contint1001 seems to be nearing the end of its life and we'd like to move ahead with the replacement as quick as possible.

Note the contint machines require a public IPv4 address in order to be able to reach out WMCS instances. Currently we have:

fqdnIPv4
contint1001.wikimedia.org208.80.154.17
contint2001.wikimedia.org208.80.153.15

Given this task to replace contint1001, its IPv4 address can be reclaimed once the migration has completed and the contint1001 host is decommissioned.

Change 860093 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add contint1002 to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/860093

Change 860093 merged by Papaul:

[operations/puppet@production] Add contint1002 to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/860093

Given this task to replace contint1001, its IPv4 address can be reclaimed once the migration has completed and the contint1001 host is decommissioned.

Ah, so you are saying you don't need a new machine in eqiad in parallel while contint1001 still exists?

That means we can and should start decom'ing contint1001 .. now?

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host contint1002.wikimedia.org with OS buster

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host contint1002.wikimedia.org with OS buster completed:

  • contint1002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211231909_pt1979_401994_contint1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
Papaul updated the task description. (Show Details)

@LSobanski this is done

If this is done, I assume the IP addresses can't have stayed the same as @hashar was asking. But given that netbox will assign one automatically that was probably never an option.

@Dzahn yes the server has a Public IP address