Page MenuHomePhabricator

(Need By: TBD) rack/setup/install an-worker11[02-17]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of an-worker11[02-17]

Hostname / Racking / Installation Details

Hostnames: an-worker11[02-17] (Please note this originally listed as an-worker1096 - an-worker1111, but that range was partially used on racking task T254892, so the an-worker range was merely incremented to the next available hostnames.)
Racking Proposal: They should be racked spread as evenly as possible across rows. Current row allocation of Hadoop worker nodes is described here.

Networking/Subnet/VLAN/IP:

  • These nodes use 10G NICs and as such require 10G switch ports.
  • These belong in the Analylics VLAN.

Partitioning/RAID:
please use partman/analytics-flex.cfg

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

an-worker1102:

  • - receive in system on procurement task T246784 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-worker1103:

  • - receive in system on procurement task T246784 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-worker1104:

  • - receive in system on procurement task T246784 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-worker1105:

  • - receive in system on procurement task T246784 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-worker1106:

  • - receive in system on procurement task T246784 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-worker1107:

  • - receive in system on procurement task T246784 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-worker1108:

  • - receive in system on procurement task T246784 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-worker1109:

  • - receive in system on procurement task T246784 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-worker1110:

  • - receive in system on procurement task T246784 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-worker1111:

  • - receive in system on procurement task T246784 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-worker1112:

  • - receive in system on procurement task T246784 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-worker1113:

  • - receive in system on procurement task T246784 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-worker1114:

  • - receive in system on procurement task T246784 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-worker1115:

  • - receive in system on procurement task T246784 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-worker1116:

  • - receive in system on procurement task T246784 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-worker1117:

  • - receive in system on procurement task T246784 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Completed auto-reimage of hosts:

['an-worker1104.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['an-worker1115.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['an-worker1116.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['an-worker1113.eqiad.wmnet']

Of which those FAILED:

['an-worker1113.eqiad.wmnet']

Completed auto-reimage of hosts:

['an-worker1111.eqiad.wmnet']

Of which those FAILED:

['an-worker1111.eqiad.wmnet']

Completed auto-reimage of hosts:

['an-worker1114.eqiad.wmnet']

Of which those FAILED:

['an-worker1114.eqiad.wmnet']

Completed auto-reimage of hosts:

['an-worker1117.eqiad.wmnet']

Of which those FAILED:

['an-worker1117.eqiad.wmnet']

@Cmjohnson I'd need these to be on Stretch, I have updated dhcp accordingly, will try to reimage :)

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1102.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009070947_elukey_7498.log.

Completed auto-reimage of hosts:

['an-worker1102.eqiad.wmnet']

and were ALL successful.

@Cmjohnson I think that the two SSDs in the flex bay are not configured with hardware RAID1 (like all the other hadoop worker nodes that we have now), I see the following after the reimage:

elukey@an-worker1102:~$ sudo lsblk -f
NAME                          FSTYPE      LABEL UUID                                   MOUNTPOINT
sda
├─sda1                        ext4              5ddb20e4-be0d-42d8-80d8-4b0bff2cdff6   /boot
└─sda2                        LVM2_member       wZhrhR-tXT6-a0Rh-K5Zw-40lV-6Xi3-ytx2Xb
  ├─an--worker1102--vg-swap   swap              13f21929-ee86-49a2-bf0c-b9dcb25d0f32   [SWAP]
  ├─an--worker1102--vg-root   ext4              31caa858-03db-4351-9619-31ac5c351e49   /
  └─an--worker1102--vg-unused
sdb
sdc
sdd
sde
sdf
sdg
sdh
sdi
sdj
sdk
sdl

The sdb-> sdl devices are 11 (instead of 12), and /dev/sda looks like it is a 4TB disk.

I went into system config and found Physical Disk 00:01:12: SSD, SATA, 446.625GB, Ready, (512B), but in theory we have 2x220G SSDs (so in raid 1 I'd expect 220G available). Moreover, the boot device is one of the 4TB disks.

@Cmjohnson is there a way to unblock this? I can work on all node with some guidance about how to make the disks in the flex bay to appear as /dev/sda :)

here is what the Controller is showing

Screen Shot 2020-09-11 at 11.10.29 AM.png (650×1 px, 349 KB)

elukey changed the task status from Open to Stalled.Sep 11 2020, 4:31 PM

Pending T262690

Cmjohnson closed subtask Restricted Task as Resolved.
Cmjohnson added a subscriber: RobH.

@RobH the new ssds have been installed to these servers, I appreciate you fixing the raid and doing the installs.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1102.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009281649_robh_11370_an-worker1102_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1102.eqiad.wmnet']

Of which those FAILED:

['an-worker1102.eqiad.wmnet']

Ok, I setup an-worker1102 with raid1 on the two SSDS, and each HDD as its own raid0.

Now it gets an LVM label in use error on reimaging/partitioning, which shouldn't happen, so it fails.

This is stalled until someone can figure this out, or I can circle back to it later this week. I've been pulled off this reimage for procurement items for the next couple of days.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1102.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009281803_robh_24912_an-worker1102_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1102.eqiad.wmnet']

and were ALL successful.

RobH updated the task description. (Show Details)

Ok, updates:

  • an-worker1102 is now staged and ready for service owners to take it over.
  • I am working through the other hosts, rebuilding all of the raid arrays properly, and then will attempt to batch reimage the rest.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-worker1103.eqiad.wmnet', 'an-worker1104.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009281901_robh_4446.log.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1105.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009281921_robh_11186_an-worker1105_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1103.eqiad.wmnet']

Of which those FAILED:

['an-worker1104.eqiad.wmnet']

Completed auto-reimage of hosts:

['an-worker1105.eqiad.wmnet']

Of which those FAILED:

['an-worker1105.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-worker1105.eqiad.wmnet', 'an-worker1106.eqiad.wmnet', 'an-worker1107.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009282033_robh_24408.log.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1105.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009282039_robh_26679_an-worker1105_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1106.eqiad.wmnet', 'an-worker1107.eqiad.wmnet']

Of which those FAILED:

['an-worker1105.eqiad.wmnet']

Completed auto-reimage of hosts:

['an-worker1105.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-worker1108.eqiad.wmnet', 'an-worker1109.eqiad.wmnet', 'an-worker1110.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009282109_robh_4742.log.

Completed auto-reimage of hosts:

['an-worker1109.eqiad.wmnet', 'an-worker1108.eqiad.wmnet']

Of which those FAILED:

['an-worker1110.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-worker1111.eqiad.wmnet', 'an-worker1112.eqiad.wmnet', 'an-worker1113.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009282146_robh_14691.log.

Completed auto-reimage of hosts:

['an-worker1112.eqiad.wmnet']

Of which those FAILED:

['an-worker1111.eqiad.wmnet', 'an-worker1113.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1111.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009282256_robh_27287_an-worker1111_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1113.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009282303_robh_28131_an-worker1113_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1111.eqiad.wmnet']

Of which those FAILED:

['an-worker1111.eqiad.wmnet']

Completed auto-reimage of hosts:

['an-worker1113.eqiad.wmnet']

Of which those FAILED:

['an-worker1113.eqiad.wmnet']

@RobH I used the related spicerack cookbook to init an-worker1102 (install all partitions with proper labels etc..) and as far as I can see now it is in a good state. Not sure why the other reimages failed but the config on 1102 is good!

Executed the cookbook also up to an-worker1110, all looks good!

elukey changed the task status from Stalled to Open.Sep 30 2020, 6:24 AM

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1111.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009302110_robh_19044_an-worker1111_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1111.eqiad.wmnet']

Of which those FAILED:

['an-worker1111.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-worker1114.eqiad.wmnet', 'an-worker1115.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009302156_robh_26664.log.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1114.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009302204_robh_27774_an-worker1114_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1114.eqiad.wmnet']

Of which those FAILED:

['an-worker1114.eqiad.wmnet']

Completed auto-reimage of hosts:

['an-worker1115.eqiad.wmnet']

Of which those FAILED:

['an-worker1114.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1116.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009302237_robh_4732_an-worker1116_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1117.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009302237_robh_4802_an-worker1117_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1117.eqiad.wmnet']

Of which those FAILED:

['an-worker1117.eqiad.wmnet']

Completed auto-reimage of hosts:

['an-worker1116.eqiad.wmnet']

and were ALL successful.

Bootstrapped 12,15,16 - disk/partitions look good!

There are two nodes marked with fails to hit dhcp server, please check cable/port, @Cmjohnson when you have a moment can you check? an-worker1111 and an-worker1114

@ayounsi can you add the analytics vlan to cloudsw-d5 and the below server to cloudsw-c8 please and these 2 servers to it's vlan

Those racks (and switches) are dedicated to WMCS.
No new non-WMCS servers should be racked there, and old servers should be phased out as they're renewed.
Please use racks 2/4/7 for our prod infra.

@Cmjohnson an-worker1111 seems to be in the wrong rack or connected to the wrong TOR: cloudsw1-c8-eqiad.mgmt.eqiad.wmnet https://librenms.wikimedia.org/device/device=184/tab=port/port=20239/
Same thing for an-worker1113: cloudsw1-d5-eqiad.mgmt.eqiad.wmnet
https://librenms.wikimedia.org/device/device=185/tab=port/port=20245/

and an-worker1114: cloudsw1-d5-eqiad.mgmt.eqiad.wmnet https://librenms.wikimedia.org/device/device=185/tab=port/port=20247/

elukey added a subscriber: ayounsi.
elukey removed a subscriber: ayounsi.

an-worker1117 is fixed, it was preferring to PXE boot as opposed to boot from disk, so the loop was endless.

physically moved an-worker1111 from C8 to C2, updated network switch and netbox. vlan and IP stay the same
physically moved an-worker1113/1114 from D5 to D4 and update network switch and netbox, vlan and IP's stay the same

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1114.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010060708_elukey_29602.log.

Completed auto-reimage of hosts:

['an-worker1114.eqiad.wmnet']

Of which those FAILED:

['an-worker1114.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1114.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010060724_elukey_31788.log.

Change 632431 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set an-worker111[13] as Hadoop workers

https://gerrit.wikimedia.org/r/632431

Change 632431 merged by Elukey:
[operations/puppet@production] Set an-worker111[13] as Hadoop workers

https://gerrit.wikimedia.org/r/632431

Completed auto-reimage of hosts:

['an-worker1114.eqiad.wmnet']

Of which those FAILED:

['an-worker1114.eqiad.wmnet']

an-worker1114's reimage fails for:

07:35:31 | an-worker1114.eqiad.wmnet | Unable to run wmf-auto-reimage-host: Unable to find certificate fingerprint in:
sh: puppet: not found
07:35:31 | an-worker1114.eqiad.wmnet | REIMAGE END | retcode=2

@elukey if I try to ssh with the install console key I get a BusyBox... I guess that's the reason.
Basically the reimage script is polling the host waiting for the reboot after d-i and expects to ssh with the install console key. Apparently it got rebooted into busybox that has /proc/uptime and the reimage thought it was rebooted into the new OS and tried to run puppet inside the BusyBox.

It also apprears that the host is in some sort of reboot loop as I was able to connect but then failed to do so again for a couple of minutes and then able again.

BusyBox v1.22.1 (Debian 1:1.22.0-19+b3) built-in shell (ash)
Enter 'help' for a list of built-in commands.

~ # cat /proc/uptime
92.03 6472.18

The boot sequence was NIC then HD (as it happened for 1117), just fixed it thanks for the suggestion :)

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1114.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010060813_elukey_9219.log.

Completed auto-reimage of hosts:

['an-worker1114.eqiad.wmnet']

and were ALL successful.

Change 632442 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set an-worker1114 as Hadoop worker node

https://gerrit.wikimedia.org/r/632442

Change 632442 merged by Elukey:
[operations/puppet@production] Set an-worker1114 as Hadoop worker node

https://gerrit.wikimedia.org/r/632442

elukey updated the task description. (Show Details)

All nodes are in hadoop now, looks good!