Page MenuHomePhabricator

(Due By: 2020-07-11) rack/setup/install an-worker[1096-1101]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of an-worker1096+

Hostname / Racking / Installation Details

Hostnames: an-worker1XXX
Racking Proposal: For now we are going to install these into the regular Hadoop cluster and experiment using them with labels for direct job scheduling. As such they can be considered regular analytics worker nodes and named accordinly: an-worker1XXX
They should be racked spread as evenly as possible along with the other hadoop worker nodes.

Networking/Subnet/VLAN/IP: 10G, single port, analytics vlan

Partitioning/Raid: Same as other analytics workers: partman/custom/analytics-flex.cfg

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

an-workers1096:

  • - receive in system on procurement task T242147
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-workers1097:

  • - receive in system on procurement task T242147
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname

[]x - network port setup (description, enable, vlan)

    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-workers1098:

  • - receive in system on procurement task T242147
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-workers1099:

  • - receive in system on procurement task T242147
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-workers1100:

  • - receive in system on procurement task T242147
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)

[x]x - bios/drac/serial setup/testing

  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

an-workers1101:

  • - receive in system on procurement task T242147
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH added a parent task: Unknown Object (Task).Jun 9 2020, 2:40 PM
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH removed a subscriber: RobH.
wiki_willy renamed this task from (Need By: TBD) rack/setup/install an-worker[1096-1101] to (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101].Jun 26 2020, 7:00 PM
Jclark-ctr updated the task description. (Show Details)
Jclark-ctr added a subscriber: Jclark-ctr.

name rack position asset_tag switchport
an-worker1096 A2 39 WMF4839 27
an-worker1097 B4 36 WMF4840 47
an-worker1098 B7 35 WMF4841 39
an-worker1099 C2 32 WMF4842 31
an-worker1100 C4 28 WMF4843 39
an-worker1101 C7 37 WMF4844 37

Cross-posting: we had a chat a while ago with dc-ops about 10g-enabled racks and availability, ending up in T243521#6005828. The config may be outdated, but we probably need to sync before adding all the nodes just to be sure?

@elukey @wiki_willy I am getting ready to do all of these and T259071 this week. What do you need? Do they no longer need 10G? Please let me know before any more work is done with them.

Yep all good, you can proceed with 10G :)

Change 622828 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding production dns for an-worker1096-1101 ipv4/ipv6

https://gerrit.wikimedia.org/r/622828

Change 622828 merged by Cmjohnson:
[operations/dns@master] Adding production dns for an-worker1096-1101 ipv4/ipv6

https://gerrit.wikimedia.org/r/622828

Change 623847 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding nodes an-worker1096-1117 to site.pp

https://gerrit.wikimedia.org/r/623847

Change 623847 merged by Cmjohnson:
[operations/puppet@production] Adding nodes an-worker1096-1117 to site.pp

https://gerrit.wikimedia.org/r/623847

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

an-worker1096.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009021946_cmjohnson_26117_an-worker1096_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1096.eqiad.wmnet']

Of which those FAILED:

['an-worker1096.eqiad.wmnet']

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

an-worker1100.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009031646_cmjohnson_21870_an-worker1100_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1100.eqiad.wmnet']

Of which those FAILED:

['an-worker1100.eqiad.wmnet']

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

an-worker1100.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009031703_cmjohnson_4494_an-worker1100_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1100.eqiad.wmnet']

Of which those FAILED:

['an-worker1100.eqiad.wmnet']

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

an-worker1096.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009031728_cmjohnson_27929_an-worker1096_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

an-worker1097.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009031729_cmjohnson_29059_an-worker1097_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

an-worker1098.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009031730_cmjohnson_30569_an-worker1098_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

an-worker1099.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009031731_cmjohnson_31602_an-worker1099_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

an-worker1100.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009031733_cmjohnson_608_an-worker1100_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

an-worker1101.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009031737_cmjohnson_7040_an-worker1101_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1096.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['an-worker1098.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['an-worker1097.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['an-worker1100.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['an-worker1099.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['an-worker1101.eqiad.wmnet']

Of which those FAILED:

['an-worker1101.eqiad.wmnet']

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

an-worker1101.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009031959_cmjohnson_10210_an-worker1101_eqiad_wmnet.log.

Change 624258 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] updating an-worker1101 dns to reflect correct rack location

https://gerrit.wikimedia.org/r/624258

Change 624258 merged by Cmjohnson:
[operations/dns@master] updating an-worker1101 dns to reflect correct rack location

https://gerrit.wikimedia.org/r/624258

Completed auto-reimage of hosts:

['an-worker1101.eqiad.wmnet']

Of which those FAILED:

['an-worker1101.eqiad.wmnet']

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

an-worker1101.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009032022_cmjohnson_31526_an-worker1101_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1101.eqiad.wmnet']

and were ALL successful.

Re-opening since some things need attention (mostly from the Analytics team):

  • these hosts don't have the flex-bay 2 disk hw raid IIRC, but 24x1.8TB disks
  • these hosts need Stretch and not Buster (we are not ready to migrate yet)

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1096.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009070853_elukey_28208.log.

Change 625608 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] install_server: change partition scheme for Hadoop workers with GPUs

https://gerrit.wikimedia.org/r/625608

Change 625608 merged by Elukey:
[operations/puppet@production] install_server: change partition scheme for Hadoop workers with GPUs

https://gerrit.wikimedia.org/r/625608

Completed auto-reimage of hosts:

['an-worker1096.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1096.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009070938_elukey_6717.log.

Completed auto-reimage of hosts:

['an-worker1096.eqiad.wmnet']

Of which those FAILED:

['an-worker1096.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1096.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009071014_elukey_16133.log.

Completed auto-reimage of hosts:

['an-worker1096.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1097.eqiad.wmnet', 'an-worker1098.eqiad.wmnet', 'an-worker1099.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009071300_elukey_2391.log.

Completed auto-reimage of hosts:

['an-worker1097.eqiad.wmnet']

Of which those FAILED:

['an-worker1099.eqiad.wmnet', 'an-worker1098.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1110.eqiad.wmnet', 'an-worker1101.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009071557_elukey_6414.log.

Completed auto-reimage of hosts:

['an-worker1110.eqiad.wmnet']

Of which those FAILED:

['an-worker1101.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1101.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009221541_elukey_30906.log.

Completed auto-reimage of hosts:

['an-worker1101.eqiad.wmnet']

Of which those FAILED:

['an-worker1101.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1101.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009221702_elukey_30132.log.

Completed auto-reimage of hosts:

['an-worker1101.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1099.eqiad.wmnet', 'an-worker1100.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009281434_elukey_14594.log.

Completed auto-reimage of hosts:

['an-worker1099.eqiad.wmnet', 'an-worker1100.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1117.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010020915_elukey_713.log.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1111.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010051714_elukey_31106.log.

Completed auto-reimage of hosts:

['an-worker1111.eqiad.wmnet']

Of which those FAILED:

['an-worker1111.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1113.eqiad.wmnet', 'an-worker1114.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010051739_elukey_4928.log.

Completed auto-reimage of hosts:

['an-worker1113.eqiad.wmnet']

Of which those FAILED:

['an-worker1114.eqiad.wmnet']
RobH mentioned this in Unknown Object (Task).Oct 26 2020, 9:10 PM