Page MenuHomePhabricator

(Due By: 2020-07-17) rack/setup/install <an-test-worker1001-1003>
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of <enter the FQDN/hostname of the hosts being setup here>

Hostname / Racking / Installation Details

Hostnames: What are the hostnames, and have you updated https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions ?
Racking Proposal: Where should these systems be racked? Can they share with any existing systems or should they avoid any other systems sharing their rack or row?
Networking/Subnet/VLAN/IP: What are the network details? 1G or 10G? Only one network port connection, or more? Subnet/vlan and IP requirements per connect?
Partitioning/Raid: Is this hardware or software raid and what raid levels should be applied to each disk? What are the partitioning requirements and is there an existing partman recipe?

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

<an-test-worker1001>:

  • - receive in system on procurement task T242148
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

<an-test-worker1002>:

  • - receive in system on procurement task T242148
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

<an-test-worker1003>:

  • - receive in system on procurement task T242148
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH added a parent task: Unknown Object (Task).Jun 16 2020, 1:33 AM
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH added subscribers: Jclark-ctr, elukey.

@elukey,

We need to know the hostname and racking details for these 3 new hadoop testing nodes. Please provide this info and reassign this task from you to @Jclark-ctr, thanks!

@RobH we ordered 6 nodes IIRC, 3 with more storage space and 3 called "lightweight", which batch is this?

I have in mind the following naming:

  • an-test-worker100X for the hosts with more storage
  • an-test-master100X for two of the "lightweight" nodes
  • an-test-coord1001 for the remaining "lightweight" node

Already updated https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Servers

@Cmjohnson will be bulk uploading to netbox after leaving data center
HOST , SWITCHPORT , RACK , UNIT, ASSET TAG
an-test-worker1001 25 A3 25 WMF4833
an-test-worker1002 5 C3 33 WMF4834
an-test-worker1003 43 D3 18 WMF4835

No preference, if possible one host per row, otherwise any arrangement that fit bests for you!

wiki_willy renamed this task from (Need By: TBD) rack/setup/install <hadoop testing nodes> to (Due By: 2020-07-17) rack/setup/install <hadoop testing nodes>.Jun 26 2020, 7:00 PM

+an-test-worker1001 1H IN A 10.65.0.68
+an-test-worker1002 1H IN A 10.65.0.69
+an-test-worker1003 1H IN A 10.65.0.70

Cmjohnson renamed this task from (Due By: 2020-07-17) rack/setup/install <hadoop testing nodes> to (Due By: 2020-07-17) rack/setup/install <an-test-worker1001-1003>.Jul 8 2020, 4:33 PM
Cmjohnson updated the task description. (Show Details)

@Cmjohnson

@RobH we ordered 6 nodes IIRC, 3 with more storage space and 3 called "lightweight", which batch is this?

I have in mind the following naming:

  • an-test-worker100X for the hosts with more storage
  • an-test-master100X for two of the "lightweight" nodes
  • an-test-coord1001 for the remaining "lightweight" node

Already updated https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Servers

@Cmjohnson sorry spaced where flipped in netbox, netbox is correct now.
an-test-worker1001 A5 30 WMF4833
an-test-worker1002 C5 34 WMF4834
an-test-worker1003 D6 32 WMF4835

Change 617148 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding all dns for an-test-worker hosts in eqiad

https://gerrit.wikimedia.org/r/617148

Change 617148 merged by Cmjohnson:
[operations/dns@master] Adding all dns for an-test-worker hosts in eqiad

https://gerrit.wikimedia.org/r/617148

Change 619316 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Updating an-test-worker1003 mgmt ip address - wrong asset tag number

https://gerrit.wikimedia.org/r/619316

Change 619316 merged by Cmjohnson:
[operations/dns@master] Updating an-test-worker1003 mgmt ip address - wrong asset tag number

https://gerrit.wikimedia.org/r/619316

asw2-c-eqiad:ge-5/0/36 has been flapping a lot. I disabled the port feel free to re-enable it when you're working on it.

Change 622125 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Add an-test-workers dhcp file

https://gerrit.wikimedia.org/r/622125

Change 622125 merged by Cmjohnson:
[operations/puppet@production] Add an-test-workers dhcp file

https://gerrit.wikimedia.org/r/622125

@elukey I do not know what partman recipe you need for these. Need to update that info and enable the ports and the servers are ready for imaging.

Change 622130 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding an-test-workers to site.pp insetup role

https://gerrit.wikimedia.org/r/622130

Change 622130 merged by Cmjohnson:
[operations/puppet@production] Adding an-test-workers to site.pp insetup role

https://gerrit.wikimedia.org/r/622130

@Cmjohnson we can use the standard recipe for misc nodes, these should have 4 disks so I'd say partman/standard.cfg partman/raid10-4dev.cfg ?

Change 622476 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add partman recipe for an-test-worker nodes

https://gerrit.wikimedia.org/r/622476

Change 622476 merged by Elukey:
[operations/puppet@production] Add partman recipe for an-test-worker nodes

https://gerrit.wikimedia.org/r/622476

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-test-worker1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008260757_elukey_23276.log.

Change 622529 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] install_server: set stretch for an-test-* hosts

https://gerrit.wikimedia.org/r/622529

Completed auto-reimage of hosts:

['an-test-worker1002.eqiad.wmnet']

and were ALL successful.

Change 622529 merged by Elukey:
[operations/puppet@production] install_server: set stretch for an-test-* hosts

https://gerrit.wikimedia.org/r/622529

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-test-worker1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008260858_elukey_24809.log.

Completed auto-reimage of hosts:

['an-test-worker1002.eqiad.wmnet']

and were ALL successful.

The an-test-worker1002 host is now running Stretch (not ready for Buster yet) with the following lvs volumes:

elukey@an-test-worker1002:~$ sudo lvs
  LV   VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root vg0 -wi-ao----  74.50g
  srv  vg0 -wi-ao----  14.48t
  swap vg0 -wi-ao---- 976.00m

I think that it is acceptable, we'll get 3x15T total space (in the current Hadoop test cluster we use only half a terabyte). We could think about other partition schemes but it feels that it would complicate maintenance, so I'd say let's reimage 1001/1003 and see how it goes.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-test-worker1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008260958_elukey_16720.log.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-test-worker1003.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008261012_elukey_30327.log.

Completed auto-reimage of hosts:

['an-test-worker1001.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['an-test-worker1003.eqiad.wmnet']

and were ALL successful.

Hosts reimaged, and the status on netbox is "Staged". @Cmjohnson please check if there is anything left to do, if not let's close :)

Change 622619 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding an-test-worker1001-3 to netboot.cfg

https://gerrit.wikimedia.org/r/622619

Change 622619 abandoned by Cmjohnson:
[operations/puppet@production] Adding an-test-worker1001-3 to netboot.cfg

Reason:

https://gerrit.wikimedia.org/r/622619