⚓ T259071 (Need By: TBD) rack/setup/install an-worker11[02-17]

Subject	Repo	Branch	Lines +/-
Set an-worker1114 as Hadoop worker node	operations/puppet	production	+11 -10
Set an-worker111[13] as Hadoop workers	operations/puppet	production	+2 -2
Adding an-worker1096-1117 to netboot.cfg file	operations/puppet	production	+1 -1
Adding mac addresses for an-worker1096-1117 to dhcpd file	operations/puppet	production	+109 -0
Add production dns for an-worker1104-1117 both ipv4/ipv6	operations/dns	master	+64 -7

Completed auto-reimage of hosts:

['an-worker1104.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['an-worker1115.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['an-worker1116.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['an-worker1113.eqiad.wmnet']

Of which those FAILED:

['an-worker1113.eqiad.wmnet']

Completed auto-reimage of hosts:

['an-worker1111.eqiad.wmnet']

Of which those FAILED:

['an-worker1111.eqiad.wmnet']

Completed auto-reimage of hosts:

['an-worker1114.eqiad.wmnet']

Of which those FAILED:

['an-worker1114.eqiad.wmnet']

Completed auto-reimage of hosts:

['an-worker1117.eqiad.wmnet']

Of which those FAILED:

['an-worker1117.eqiad.wmnet']

@Cmjohnson I'd need these to be on Stretch, I have updated dhcp accordingly, will try to reimage :)

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1102.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009070947_elukey_7498.log.

Completed auto-reimage of hosts:

['an-worker1102.eqiad.wmnet']

and were ALL successful.

@Cmjohnson I think that the two SSDs in the flex bay are not configured with hardware RAID1 (like all the other hadoop worker nodes that we have now), I see the following after the reimage:

elukey@an-worker1102:~$ sudo lsblk -f
NAME                          FSTYPE      LABEL UUID                                   MOUNTPOINT
sda
├─sda1                        ext4              5ddb20e4-be0d-42d8-80d8-4b0bff2cdff6   /boot
└─sda2                        LVM2_member       wZhrhR-tXT6-a0Rh-K5Zw-40lV-6Xi3-ytx2Xb
  ├─an--worker1102--vg-swap   swap              13f21929-ee86-49a2-bf0c-b9dcb25d0f32   [SWAP]
  ├─an--worker1102--vg-root   ext4              31caa858-03db-4351-9619-31ac5c351e49   /
  └─an--worker1102--vg-unused
sdb
sdc
sdd
sde
sdf
sdg
sdh
sdi
sdj
sdk
sdl

The sdb-> sdl devices are 11 (instead of 12), and /dev/sda looks like it is a 4TB disk.

I went into system config and found Physical Disk 00:01:12: SSD, SATA, 446.625GB, Ready, (512B), but in theory we have 2x220G SSDs (so in raid 1 I'd expect 220G available). Moreover, the boot device is one of the 4TB disks.

@Cmjohnson is there a way to unblock this? I can work on all node with some guidance about how to make the disks in the flex bay to appear as /dev/sda :)

here is what the Controller is showing

Screen Shot 2020-09-11 at 11.10.29 AM.png (650×1 px, 349 KB)

Pending T262690

@RobH the new ssds have been installed to these servers, I appreciate you fixing the raid and doing the installs.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1102.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009281649_robh_11370_an-worker1102_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1102.eqiad.wmnet']

Of which those FAILED:

['an-worker1102.eqiad.wmnet']

Ok, I setup an-worker1102 with raid1 on the two SSDS, and each HDD as its own raid0.

Now it gets an LVM label in use error on reimaging/partitioning, which shouldn't happen, so it fails.

This is stalled until someone can figure this out, or I can circle back to it later this week. I've been pulled off this reimage for procurement items for the next couple of days.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1102.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009281803_robh_24912_an-worker1102_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1102.eqiad.wmnet']

and were ALL successful.

Ok, updates:

an-worker1102 is now staged and ready for service owners to take it over.
I am working through the other hosts, rebuilding all of the raid arrays properly, and then will attempt to batch reimage the rest.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-worker1103.eqiad.wmnet', 'an-worker1104.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009281901_robh_4446.log.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1105.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009281921_robh_11186_an-worker1105_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1103.eqiad.wmnet']

Of which those FAILED:

['an-worker1104.eqiad.wmnet']

Completed auto-reimage of hosts:

['an-worker1105.eqiad.wmnet']

Of which those FAILED:

['an-worker1105.eqiad.wmnet']

RobH updated the task description. (Show Details)Sep 28 2020, 7:38 PM

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-worker1105.eqiad.wmnet', 'an-worker1106.eqiad.wmnet', 'an-worker1107.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009282033_robh_24408.log.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1105.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009282039_robh_26679_an-worker1105_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1106.eqiad.wmnet', 'an-worker1107.eqiad.wmnet']

Of which those FAILED:

['an-worker1105.eqiad.wmnet']

RobH updated the task description. (Show Details)Sep 28 2020, 9:00 PM

Completed auto-reimage of hosts:

['an-worker1105.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-worker1108.eqiad.wmnet', 'an-worker1109.eqiad.wmnet', 'an-worker1110.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009282109_robh_4742.log.

Completed auto-reimage of hosts:

['an-worker1109.eqiad.wmnet', 'an-worker1108.eqiad.wmnet']

Of which those FAILED:

['an-worker1110.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-worker1111.eqiad.wmnet', 'an-worker1112.eqiad.wmnet', 'an-worker1113.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009282146_robh_14691.log.

RobH updated the task description. (Show Details)Sep 28 2020, 9:59 PM

Completed auto-reimage of hosts:

['an-worker1112.eqiad.wmnet']

Of which those FAILED:

['an-worker1111.eqiad.wmnet', 'an-worker1113.eqiad.wmnet']

RobH updated the task description. (Show Details)Sep 28 2020, 10:37 PM

RobH updated the task description. (Show Details)Sep 28 2020, 10:39 PM

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1111.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009282256_robh_27287_an-worker1111_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1113.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009282303_robh_28131_an-worker1113_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1111.eqiad.wmnet']

Of which those FAILED:

['an-worker1111.eqiad.wmnet']

Completed auto-reimage of hosts:

['an-worker1113.eqiad.wmnet']

Of which those FAILED:

['an-worker1113.eqiad.wmnet']

@RobH I used the related spicerack cookbook to init an-worker1102 (install all partitions with proper labels etc..) and as far as I can see now it is in a good state. Not sure why the other reimages failed but the config on 1102 is good!

Executed the cookbook also up to an-worker1110, all looks good!

elukey changed the task status from Stalled to Open.Sep 30 2020, 6:24 AM

ayounsi unsubscribed.Sep 30 2020, 6:27 AM

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1111.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009302110_robh_19044_an-worker1111_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1111.eqiad.wmnet']

Of which those FAILED:

['an-worker1111.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-worker1114.eqiad.wmnet', 'an-worker1115.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009302156_robh_26664.log.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1114.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009302204_robh_27774_an-worker1114_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1114.eqiad.wmnet']

Of which those FAILED:

['an-worker1114.eqiad.wmnet']

RobH updated the task description. (Show Details)Sep 30 2020, 10:12 PM

Completed auto-reimage of hosts:

['an-worker1115.eqiad.wmnet']

Of which those FAILED:

['an-worker1114.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1116.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009302237_robh_4732_an-worker1116_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1117.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009302237_robh_4802_an-worker1117_eqiad_wmnet.log.

RobH updated the task description. (Show Details)Sep 30 2020, 10:40 PM

Completed auto-reimage of hosts:

['an-worker1117.eqiad.wmnet']

Of which those FAILED:

['an-worker1117.eqiad.wmnet']

Completed auto-reimage of hosts:

['an-worker1116.eqiad.wmnet']

and were ALL successful.

RobH updated the task description. (Show Details)Sep 30 2020, 11:07 PM

Bootstrapped 12,15,16 - disk/partitions look good!

There are two nodes marked with fails to hit dhcp server, please check cable/port, @Cmjohnson when you have a moment can you check? an-worker1111 and an-worker1114

In T259071#6432303, @ayounsi wrote:

In T259071#6430943, @Cmjohnson wrote:

@ayounsi can you add the analytics vlan to cloudsw-d5 and the below server to cloudsw-c8 please and these 2 servers to it's vlan

Those racks (and switches) are dedicated to WMCS.
No new non-WMCS servers should be racked there, and old servers should be phased out as they're renewed.
Please use racks 2/4/7 for our prod infra.

@Cmjohnson an-worker1111 seems to be in the wrong rack or connected to the wrong TOR: cloudsw1-c8-eqiad.mgmt.eqiad.wmnet https://librenms.wikimedia.org/device/device=184/tab=port/port=20239/
Same thing for an-worker1113: cloudsw1-d5-eqiad.mgmt.eqiad.wmnet
https://librenms.wikimedia.org/device/device=185/tab=port/port=20245/

and an-worker1114: cloudsw1-d5-eqiad.mgmt.eqiad.wmnet https://librenms.wikimedia.org/device/device=185/tab=port/port=20247/

elukey updated the task description. (Show Details)Oct 2 2020, 9:57 AM

elukey added a subscriber: ayounsi.

elukey removed a subscriber: ayounsi.

an-worker1117 is fixed, it was preferring to PXE boot as opposed to boot from disk, so the loop was endless.

RobH unsubscribed.Oct 2 2020, 4:07 PM

physically moved an-worker1111 from C8 to C2, updated network switch and netbox. vlan and IP stay the same
physically moved an-worker1113/1114 from D5 to D4 and update network switch and netbox, vlan and IP's stay the same

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1114.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010060708_elukey_29602.log.

elukey updated the task description. (Show Details)Oct 6 2020, 7:10 AM

Completed auto-reimage of hosts:

['an-worker1114.eqiad.wmnet']

Of which those FAILED:

['an-worker1114.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1114.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010060724_elukey_31788.log.

Change 632431 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set an-worker111[13] as Hadoop workers

https://gerrit.wikimedia.org/r/632431

Change 632431 merged by Elukey:
[operations/puppet@production] Set an-worker111[13] as Hadoop workers

https://gerrit.wikimedia.org/r/632431

Completed auto-reimage of hosts:

['an-worker1114.eqiad.wmnet']

Of which those FAILED:

['an-worker1114.eqiad.wmnet']

an-worker1114's reimage fails for:

07:35:31 | an-worker1114.eqiad.wmnet | Unable to run wmf-auto-reimage-host: Unable to find certificate fingerprint in:
sh: puppet: not found
07:35:31 | an-worker1114.eqiad.wmnet | REIMAGE END | retcode=2

@elukey if I try to ssh with the install console key I get a BusyBox... I guess that's the reason.
Basically the reimage script is polling the host waiting for the reboot after d-i and expects to ssh with the install console key. Apparently it got rebooted into busybox that has /proc/uptime and the reimage thought it was rebooted into the new OS and tried to run puppet inside the BusyBox.

It also apprears that the host is in some sort of reboot loop as I was able to connect but then failed to do so again for a couple of minutes and then able again.

BusyBox v1.22.1 (Debian 1:1.22.0-19+b3) built-in shell (ash)
Enter 'help' for a list of built-in commands.

~ # cat /proc/uptime
92.03 6472.18

The boot sequence was NIC then HD (as it happened for 1117), just fixed it thanks for the suggestion :)

Maintenance_bot removed a project: Patch-For-Review.Oct 6 2020, 8:10 AM

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1114.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010060813_elukey_9219.log.

Completed auto-reimage of hosts:

['an-worker1114.eqiad.wmnet']

and were ALL successful.

Change 632442 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set an-worker1114 as Hadoop worker node

https://gerrit.wikimedia.org/r/632442

Change 632442 merged by Elukey:
[operations/puppet@production] Set an-worker1114 as Hadoop worker node

https://gerrit.wikimedia.org/r/632442

Maintenance_bot removed a project: Patch-For-Review.Oct 6 2020, 9:10 AM

All nodes are in hadoop now, looks good!

Status	Assigned	Task
Resolved	None	T244211 Analytics Hardware for Fiscal Year 2019/2020
Resolved	Ottomata	T243521 Hadoop Hardware Orders FY2019-2020
		Unknown Object (Task)
Resolved	• Cmjohnson	T259071 (Need By: TBD) rack/setup/install an-worker11[02-17]
		Restricted Task

(Need By: TBD) rack/setup/install an-worker11[02-17]
Closed, ResolvedPublic
Actions

Description

Hostname / Racking / Installation Details

Per host setup checklist

Details

Related Objects
Search...

Event Timeline

	F32270656: Screen Shot 2020-09-11 at 11.10.29 AM.png
	Sep 11 2020, 4:17 PM

(Need By: TBD) rack/setup/install an-worker11[02-17]Closed, ResolvedPublicActions

Description

Hostname / Racking / Installation Details

Per host setup checklist

Details

Related ObjectsSearch...

Event Timeline

(Need By: TBD) rack/setup/install an-worker11[02-17]
Closed, ResolvedPublic
Actions

Related Objects
Search...