Page MenuHomePhabricator

Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ml-cache100[1-3]

Hostname / Racking / Installation Details

Hostnames: ml-cache100[1-3]
Racking Proposal: Spread in rows as much as possible, and bonus point if they don't share anything with the ml-serve100* nodes. If Rows E/F are available, utilize them to accomplish this.
Networking/Subnet/VLAN/IP: Single 10G private1 vlan connection.
Partitioning/Raid: default raid1 2device
OS Distro: Bullseye (default unless otherwise specified)

Per host setup checklist

ml-cache1001
  • - receive in system on procurement task T297638 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.

[]x - network port setup via netbox, run homer to commit

  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ml-cache1002
  • - receive in system on procurement task T297638 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ml-cache1003
  • - receive in system on procurement task T297638 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Related Objects

StatusSubtypeAssignedTask
Resolved Cmjohnson

Event Timeline

RobH created this task.
RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH edited subscribers, added: calbon, elukey; removed: RobH.

ml-cache1001 E1 U23
ml-cache1002 E2 U23
ml-cache1003 F1 U23

namerack_nameportcableid

ml-cache1001 E1 23 20220147
ml-cache1002 E2 23 20220137
ml-cache1003 F1 23 20220125 |

Change 773558 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding ml-cache1001-3 to site.pp

https://gerrit.wikimedia.org/r/773558

Change 773558 merged by Cmjohnson:

[operations/puppet@production] Adding ml-cache1001-3 to site.pp

https://gerrit.wikimedia.org/r/773558

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1001.eqiad.wmnet with OS bullseye completed:

  • ml-cache1001 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203241539_cmjohnson_3126180_ml-cache1001.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye completed:

  • ml-cache1002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203241608_cmjohnson_3131163_ml-cache1002.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1003.eqiad.wmnet with OS bullseye completed:

  • ml-cache1003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203241618_cmjohnson_3132050_ml-cache1003.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
Cmjohnson updated the task description. (Show Details)

Completed

Hi Chris! I noticed that we have two nodes on the same ROW, would it be possible to move one elsewhere? We are going to host a Cassandra cluster on the nodes, and having 2/3 of the cluster unavailable for a ROW failure is a big risk. I can help with reimages etc.. if needed, sorry for the trouble!

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye executed with errors:

  • ml-cache1002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye executed with errors:

  • ml-cache1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

@elukey Since moving the server, I cannot get it to install the OS correctly, can you please take a look. Thanks

I tried to PXE boot and it didn't work, so I checked netbox and the interface listed looks weird: xe-4/0/010 https://netbox.wikimedia.org/dcim/interfaces/24986/

elukey@asw2-c-eqiad> show interfaces descriptions | match ml-cache 
xe-4/0/10       down  down ml-cache1002 {#4778}

So I have deleted the /10 interface in netbox, and renamed the /010 to /10. Now homer offers me this diff:

Changes for 1 devices: ['asw2-c-eqiad.mgmt.eqiad.wmnet']

[edit interfaces interface-range disabled]
-    member xe-4/0/10;
[edit interfaces interface-range vlan-private1-c-eqiad]
     member xe-4/0/7 { ... }
+    member xe-4/0/10;
     member xe-4/0/11 { ... }
[edit interfaces interface-range vlan-private1-c-eqiad]
-    member xe-4/0/010;

Not entirely sure if the following is removed though:

elukey@asw2-c-eqiad> show interfaces descriptions xe-4/0/010  
Interface       Admin Link Description
xe-4/0/010                 ml-cache1002 {#4778}
elukey@asw2-c-eqiad> show interfaces descriptions xe-4/0/10  
Interface       Admin Link Description
xe-4/0/10       up    up   ml-cache1002 {#4778}

Way better now!

@Cmjohnson I tried to reimage the node but I got spicerack.dhcp.DHCPError: target file ttyS1-115200/ml-cache1002.conf exists, I think that you have already a cookbook open for ml-cache1002, so if you can close it we can complete the task :)

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye executed with errors:

  • ml-cache1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Host reimaged correctly, all done!