Page MenuHomePhabricator

Q1:rack/setup/install new eqiad memcached hosts
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of two new memcached hosts in eqiad.

Hostname / Racking / Installation Details

Hostnames: What are the hostnames, and have you updated https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions ? mc-wf100{1,2}, and yes.
Racking Proposal: The two servers need to be racked in different rows from one another. New service cluster, so other memcached hosts/service groups placement are not relevant.
Networking Setup: # of Connections:1, Speed:1G. Vlan:Private AAAA records:Y, Additional IP records (Cassandra)? No
Partitioning/Raid: HW Raid: Y. Partman recipe and/or desired Raid Level: Raid 1
OS Distro: Bullseye (default unless otherwise specified):
Sub-team Technical Contact: @Joe

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

mc-wf1001:
  • - receive in system on procurement task T311855 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
mc-wf1002:
  • - receive in system on procurement task T311855 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.

@Joe,

Reassigning this to you per our IRC discussion. Pending needs from you/serviceops:

  • Please populate racking details section with hostname, OS and all other fields before these hosts arrive.
    • The order task is being escalated for placement this week (before July 30th) so expect to see these arrive before mid-August.
  • Once details are added into this task, please reassign it to @Jclark-ctr

You'll notice there is a 'implemenetation task' also linked in, which was a serviceops ask for any servers being handed off to the team. I don't have hostname info, so it is generic like this task.

RobH renamed this task from Q1:rack/setup/install new memcached hosts to Q1:rack/setup/install new codfw memcached hosts.Jul 27 2022, 6:28 PM
RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH renamed this task from Q1:rack/setup/install new codfw memcached hosts to Q1:rack/setup/install new eqiad memcached hosts.Jul 27 2022, 6:31 PM
RobH edited projects, added ops-eqiad; removed ops-codfw.
RobH updated the task description. (Show Details)
RobH removed a subscriber: Papaul.
RobH added a subscriber: Jclark-ctr.
Joe removed Joe as the assignee of this task.Aug 8 2022, 6:42 AM
Joe updated the task description. (Show Details)
Joe added a subscriber: RobH.

@RobH all info should be filled in now.

Joe updated the task description. (Show Details)

mc-wf1001 B8 U25 Port 27 Cableid 3286
mc-wf1002 D8 U26 Port 30 Cableid 2013339101803

@Joe which partman recipe do you need for these?

Change 835701 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] adding mc-wf to site.pp

https://gerrit.wikimedia.org/r/835701

Change 835701 merged by Cmjohnson:

[operations/puppet@production] adding mc-wf to site.pp

https://gerrit.wikimedia.org/r/835701

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mc-wf1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mc-wf1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mc-wf1001.eqiad.wmnet with OS bullseye completed:

  • mc-wf1001 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202209272144_cmjohnson_1095412_mc-wf1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mc-wf1002.eqiad.wmnet with OS bullseye completed:

  • mc-wf1002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202209272147_cmjohnson_1095813_mc-wf1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
Cmjohnson updated the task description. (Show Details)

@Joe all yours, figured it to be the same partman recipe as memcache