Page MenuHomePhabricator

(Due By: 2020-07-02) rack/setup/install 3 lightweight hadoop nodes
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of <enter the FQDN/hostname of the hosts being setup here>

Hostname / Racking / Installation Details

Hostnames: What are the hostnames, and have you updated https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions ?
Racking Proposal: Where should these systems be racked? Can they share with any existing systems or should they avoid any other systems sharing their rack or row?
Networking/Subnet/VLAN/IP: What are the network details? 1G or 10G? Only one network port connection, or more? Subnet/vlan and IP requirements per connect?
Partitioning/Raid: Is this hardware or software raid and what raid levels should be applied to each disk? What are the partitioning requirements and is there an existing partman recipe?
OS Distro: Stretch or Buster?

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

<hostname#1>:

  • - receive in system on procurement task T251189
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

<hostname#2>:

  • - receive in system on procurement task T251189
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

<hostname#3>:

  • - receive in system on procurement task T251189
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH added a parent task: Unknown Object (Task).

@elukey:

This needs to have the hostname and racking info filled out by your team (if they should be in differing rows than one another, etc). Usually this is required before order placement, but this was rushed through due to end of fiscal.

Please go ahead and fill out the hostname/racking details and assign to @Jclark-ctr for followup, thanks.

@Cmjohnson will be bulk uploading to netbox after leaving data center
an-test-master1001 30 A5 30 WMF4836
an-test-master1002 36 C5 34 WMF4837
an-test-coord1001 32 D6 32 WMF4838

wiki_willy renamed this task from (Need By: TBD) rack/setup/install 3 lightweight hadoop nodes to (Due By: 2020-07-02) rack/setup/install 3 lightweight hadoop nodes.Jun 26 2020, 7:01 PM

+an-test-coord1001 1H IN A 10.65.0.73
+an-test-master1001 1H IN A 10.65.0.71
+an-test-master1002 1H IN A 10.65.0.72

@Jclark-ctr I went to setup idrac's today and notice that you have an-worker's in the rack space that you said is for an-master or an-coord. Can you please verify that the correct servers are in the correct locations. There are 2 different types of servers...1 has 2 disks and the other has 4.

@Cmjohnson correct racking
an-test-master1001 A3 25 WMF4836
an-test-master1002 C3 33 WMF4837
an-test-coord1001 D3 18 WMF4838

BIOS/IDRAC updated, switch ports labeled but disabled

Change 622132 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding an-test-coord/test-masters to dhcp file

https://gerrit.wikimedia.org/r/622132

Change 622135 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding an-test-masters/test-coord1001 to site.pp role insetup

https://gerrit.wikimedia.org/r/622135

Change 622132 merged by Cmjohnson:
[operations/puppet@production] Adding an-test-coord/test-masters to dhcp file

https://gerrit.wikimedia.org/r/622132

Change 622135 merged by Cmjohnson:
[operations/puppet@production] Adding an-test-masters/test-coord1001 to site.pp role insetup

https://gerrit.wikimedia.org/r/622135

@elukey The same thing for these, I need to know what partman recipe you want or feel free to add them yourself. Once that is done I need to enable the network ports.

Change 622479 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add partman recipe for an-test-(master|coord) hosts

https://gerrit.wikimedia.org/r/622479

Change 622479 merged by Elukey:
[operations/puppet@production] Add partman recipe for an-test-(master|coord) hosts

https://gerrit.wikimedia.org/r/622479

Change 622529 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] install_server: set stretch for an-test-* hosts

https://gerrit.wikimedia.org/r/622529

@Cmjohnson added the config for partman and os, I think that we are missing DNS records and then we can reimage.

Change 622622 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding production dns an-test-master100[12] & an-test-coord1001

https://gerrit.wikimedia.org/r/622622

Change 622622 merged by Cmjohnson:
[operations/dns@master] Adding production dns an-test-master100[12] & an-test-coord1001

https://gerrit.wikimedia.org/r/622622

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

an-test-master1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202008261750_cmjohnson_3075_an-test-master1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-test-master1001.eqiad.wmnet']

Of which those FAILED:

['an-test-master1001.eqiad.wmnet']

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

an-test-master1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202008261801_cmjohnson_23501_an-test-master1002_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

an-test-coord1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202008261803_cmjohnson_28320_an-test-coord1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-test-master1002.eqiad.wmnet']

Of which those FAILED:

['an-test-master1002.eqiad.wmnet']

Completed auto-reimage of hosts:

['an-test-coord1001.eqiad.wmnet']

Of which those FAILED:

['an-test-coord1001.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-test-master1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008270619_elukey_16739.log.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-test-master1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008270633_elukey_9753.log.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-test-coord1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008270634_elukey_24906.log.

Completed auto-reimage of hosts:

['an-test-master1001.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['an-test-master1002.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['an-test-coord1001.eqiad.wmnet']

and were ALL successful.

Looks good!

elukey@an-test-coord1001:~$ df -h
Filesystem            Size  Used Avail Use% Mounted on
udev                   63G     0   63G   0% /dev
tmpfs                  13G  9.6M   13G   1% /run
/dev/mapper/vg0-root   73G  1.4G   68G   2% /
tmpfs                  63G     0   63G   0% /dev/shm
tmpfs                 5.0M     0  5.0M   0% /run/lock
tmpfs                  63G     0   63G   0% /sys/fs/cgroup
/dev/mapper/vg0-srv   365G   69M  346G   1% /srv
tmpfs                  13G     0   13G   0% /run/user/13926

I have set all three hosts to "staged" in Netbox, we should be good.