Page MenuHomePhabricator

Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of an-db100[12].eqiad.wmnet

Hostname / Racking / Installation Details

Hostnames: an-db100[12].eqiad.wmnet
Racking Proposal: any rows are fine, but please rack each of these 2 servers in separate rows.
Networking/Subnet/VLAN/IP: Analytics VLAN, 1G NIC
Partitioning/Raid:standard, raid1-2dev (This originally read raid10-4dev but config C-1G is a dual disk config.)
OS Distro: Buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

an-db1001.eqiad.wmnet:

  • - receive in system on procurement task T286517 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-db1002.eqiad.wmnet:

  • - receive in system on procurement task T286517 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH updated the task description. (Show Details)
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH updated the task description. (Show Details)
RobH added a subscriber: Ottomata.
RobH renamed this task from (Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet to Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet.Aug 26 2021, 7:46 PM

Awesome! Servers in the DC! I can/would work on these boxes ASAP...in case that factors into the priority for this ticket. :)

an-db1001 A6 U26 cableid1951 port 28
an-db1002 C5 U21 cableid1842 port 13

Change 724483 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding new servers an-db1001-2 to site.pp, dhcpd and netboot.cfg

https://gerrit.wikimedia.org/r/724483

Change 724483 merged by Cmjohnson:

[operations/puppet@production] Adding new servers an-db1001-2 to site.pp, dhcpd and netboot.cfg

https://gerrit.wikimedia.org/r/724483

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

an-db1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109281805_cmjohnson_20890_an-db1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-db1001.eqiad.wmnet']

Of which those FAILED:

['an-db1001.eqiad.wmnet']

@RobH confirmed, they only have 2 disks. I'm not sure what the next step is for them

So I'm now reviewing the entire purchase history of this request.

T286517 was filed, for config C-1G which is only 2*960GB SSDs, yet the racking details list raid10-4dev. The order is actually for just 2 disks per host (per the quote put to order on T286517, F34618960.

So there was confusion caused by contridictory information in the racking details versus the actual order. These should be imaged as dual disk systems, raid1-2dev.

Change 725350 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] fixing partition for an-db hosts

https://gerrit.wikimedia.org/r/725350

Change 725350 merged by RobH:

[operations/puppet@production] fixing partition for an-db hosts

https://gerrit.wikimedia.org/r/725350

Cookbook cookbooks.sre.experimental.reimage was started by robh@cumin1001 for host an-db1001.eqiad.wmnet

RobH changed the task status from Open to In Progress.Oct 1 2021, 6:06 PM

Cookbook cookbooks.sre.experimental.reimage started by robh@cumin1001 for host an-db1001.eqiad.wmnet executed with errors:

  • an-db1001 (FAIL)
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-db1001.eqiad.wmnet', 'an-db1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202110011807_robh_23280.log.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-db1001.eqiad.wmnet', 'an-db1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202110011840_robh_27852.log.

Completed auto-reimage of hosts:

['an-db1001.eqiad.wmnet', 'an-db1002.eqiad.wmnet']

and were ALL successful.

RobH updated the task description. (Show Details)

These are now ready for use!

elukey subscribed.

Hi everybody, I noticed that the two hosts are in the private VLAN, meanwhile the task's description mentions Analytics vlan:

elukey@asw2-a-eqiad> show interfaces descriptions |match an-db  
ge-6/0/28       up    up   an-db1001 {#1951}

{master:7}
elukey@asw2-a-eqiad> show ethernet-switching interface ge-6/0/28                                    
[..]

Logical          Vlan          TAG     MAC         STP         Logical           Tagging 
interface        members               limit       state       interface flags  
ge-6/0/28.0                            294912                                     untagged   
                 private1-a-eqiad 1017 294912      Forwarding                     untagged

I'll wait for @Ottomata's confirmation but I think that we need to reimage the node with correct network settings.

RobH changed the task status from Open to In Progress.EditedNov 2 2021, 5:58 PM

irc update chatted with @elukey and these do indeed need to shift to analtyics vlan.

will have to run the decom script and then redeploy with network script

an-db1001
1951 asw2-a6-eqiad ge-6/0/28
10.65.1.52/16

an-db1002
1842 asw2-c5-eqiad ge-5/0/13
10.65.1.53/16

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: an-db1001.eqiad.wmnet

  • an-db1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host an-db1001.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host an-db1001.eqiad.wmnet with OS buster executed with errors:

  • an-db1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: an-db1002.eqiad.wmnet

  • an-db1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • COMMON_STEPS (FAIL)
    • Failed to run Homer on asw2-c-eqiad.mgmt.eqiad.wmnet: Command '['/usr/local/bin/homer', 'asw2-c-eqiad.mgmt.eqiad.wmnet', 'commit', 'Host decommission - robh@cumin1001 - T289632']' returned non-zero exit status 1.

ERROR: some step on some host failed, check the bolded items above

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host an-db1001.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host an-db1001.eqiad.wmnet with OS buster completed:

  • an-db1001 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111022028_robh_16670_an-db1001.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host an-db1002.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host an-db1002.eqiad.wmnet with OS buster executed with errors:

  • an-db1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host an-db1002.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host an-db1002.eqiad.wmnet with OS buster completed:

  • an-db1002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111022150_robh_30319_an-db1002.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Hi everybody, I noticed that the two hosts are in the private VLAN, meanwhile the task's description mentions Analytics vlan:

Both hosts now reimaged to analytics vlan as originally requested, sorry for the confusion!