⚓ T289632 Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet

	Subject	Repo	Branch	Lines +/-
	fixing partition for an-db hosts	operations/puppet	production	+1 -1
	Adding new servers an-db1001-2 to site.pp, dhcpd and netboot.cfg	operations/puppet	production	+15 -0

RobH created this task.Aug 24 2021, 10:31 PM

RobH mentioned this in Unknown Object (Task).

RobH updated the task description. (Show Details)

RobH added a parent task: Unknown Object (Task).

RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.

RobH assigned this task to Jclark-ctr.Aug 24 2021, 10:33 PM

RobH updated the task description. (Show Details)

RobH added a subscriber: Ottomata.

Maintenance_bot added a project: SRE.Aug 24 2021, 10:45 PM

RobH unsubscribed.Aug 24 2021, 11:02 PM

RhinosF1 subscribed.Aug 25 2021, 5:55 AM

odimitrijevic edited projects, added Data-Engineering, Analytics-Clusters; removed Analytics.Aug 26 2021, 4:43 PM

RobH renamed this task from (Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet to Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet.Aug 26 2021, 7:46 PM

Awesome! Servers in the DC! I can/would work on these boxes ASAP...in case that factors into the priority for this ticket. :)

Jclark-ctr updated the task description. (Show Details)Sep 21 2021, 9:01 PM

an-db1001 A6 U26 cableid1951 port 28
an-db1002 C5 U21 cableid1842 port 13

Jclark-ctr reassigned this task from Jclark-ctr to • Cmjohnson.Sep 22 2021, 10:11 PM

Jclark-ctr updated the task description. (Show Details)

Jclark-ctr subscribed.

updated dns and network

BIOS and iDrac setup

all firmware updated

Change 724483 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding new servers an-db1001-2 to site.pp, dhcpd and netboot.cfg

https://gerrit.wikimedia.org/r/724483

Change 724483 merged by Cmjohnson:

[operations/puppet@production] Adding new servers an-db1001-2 to site.pp, dhcpd and netboot.cfg

https://gerrit.wikimedia.org/r/724483

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

an-db1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109281805_cmjohnson_20890_an-db1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-db1001.eqiad.wmnet']

Of which those FAILED:

['an-db1001.eqiad.wmnet']

Maintenance_bot removed a project: Patch-For-Review.Sep 28 2021, 6:11 PM

@RobH confirmed, they only have 2 disks. I'm not sure what the next step is for them

So I'm now reviewing the entire purchase history of this request.

T286517 was filed, for config C-1G which is only 2*960GB SSDs, yet the racking details list raid10-4dev. The order is actually for just 2 disks per host (per the quote put to order on T286517, F34618960.

So there was confusion caused by contridictory information in the racking details versus the actual order. These should be imaged as dual disk systems, raid1-2dev.

thanks! @RobH

RobH updated the task description. (Show Details)Oct 1 2021, 5:49 PM

Change 725350 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] fixing partition for an-db hosts

https://gerrit.wikimedia.org/r/725350

gerritbot added a project: Patch-For-Review.Oct 1 2021, 5:51 PM

Change 725350 merged by RobH:

[operations/puppet@production] fixing partition for an-db hosts

https://gerrit.wikimedia.org/r/725350

Cookbook cookbooks.sre.experimental.reimage was started by robh@cumin1001 for host an-db1001.eqiad.wmnet

RobH changed the task status from Open to In Progress.Oct 1 2021, 6:06 PM

Cookbook cookbooks.sre.experimental.reimage started by robh@cumin1001 for host an-db1001.eqiad.wmnet executed with errors:

an-db1001 (FAIL)
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-db1001.eqiad.wmnet', 'an-db1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202110011807_robh_23280.log.

RobH removed a project: Patch-For-Review.Oct 1 2021, 6:09 PM

RobH updated the task description. (Show Details)

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-db1001.eqiad.wmnet', 'an-db1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202110011840_robh_27852.log.

Completed auto-reimage of hosts:

['an-db1001.eqiad.wmnet', 'an-db1002.eqiad.wmnet']

and were ALL successful.

These are now ready for use!

Thank you!!!

Hi everybody, I noticed that the two hosts are in the private VLAN, meanwhile the task's description mentions Analytics vlan:

elukey@asw2-a-eqiad> show interfaces descriptions |match an-db  
ge-6/0/28       up    up   an-db1001 {#1951}

{master:7}
elukey@asw2-a-eqiad> show ethernet-switching interface ge-6/0/28                                    
[..]

Logical          Vlan          TAG     MAC         STP         Logical           Tagging 
interface        members               limit       state       interface flags  
ge-6/0/28.0                            294912                                     untagged   
                 private1-a-eqiad 1017 294912      Forwarding                     untagged

I'll wait for @Ottomata's confirmation but I think that we need to reimage the node with correct network settings.

irc update chatted with @elukey and these do indeed need to shift to analtyics vlan.

will have to run the decom script and then redeploy with network script

an-db1001
1951 asw2-a6-eqiad ge-6/0/28
10.65.1.52/16

an-db1002
1842 asw2-c5-eqiad ge-5/0/13
10.65.1.53/16

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: an-db1001.eqiad.wmnet

an-db1001.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host an-db1001.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host an-db1001.eqiad.wmnet with OS buster executed with errors:

an-db1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: an-db1002.eqiad.wmnet

an-db1002.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

COMMON_STEPS (FAIL)
- Failed to run Homer on asw2-c-eqiad.mgmt.eqiad.wmnet: Command '['/usr/local/bin/homer', 'asw2-c-eqiad.mgmt.eqiad.wmnet', 'commit', 'Host decommission - robh@cumin1001 - T289632']' returned non-zero exit status 1.

ERROR: some step on some host failed, check the bolded items above

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host an-db1001.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host an-db1001.eqiad.wmnet with OS buster completed:

an-db1001 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111022028_robh_16670_an-db1001.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host an-db1002.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host an-db1002.eqiad.wmnet with OS buster executed with errors:

an-db1002 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host an-db1002.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host an-db1002.eqiad.wmnet with OS buster completed:

an-db1002 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111022150_robh_30319_an-db1002.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged

In T289632#7474421, @elukey wrote:

Hi everybody, I noticed that the two hosts are in the private VLAN, meanwhile the task's description mentions Analytics vlan:

Both hosts now reimaged to analytics vlan as originally requested, sorry for the confusion!

RobH claimed this task.Nov 2 2021, 10:15 PM

RobH added a subscriber: • Cmjohnson.

Thank you! Just in time! :)

Status	Assigned	Task
Resolved	BTullis	T280905 Analytics coordinator failover improvements
Resolved	BTullis	T284150 Bring an-mariadb100[12] into service
		Unknown Object (Task)
Resolved	RobH	T289632 Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet

Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet
Closed, ResolvedPublic
Actions

Description

Hostname / Racking / Installation Details

Per host setup checklist

Details

Related Objects
Search...

Event Timeline

Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnetClosed, ResolvedPublicActions

Description

Hostname / Racking / Installation Details

Per host setup checklist

Details

Related ObjectsSearch...

Event Timeline

Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet
Closed, ResolvedPublic
Actions

Related Objects
Search...