Page MenuHomePhabricator

(Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet
Closed, ResolvedPublic0 Estimated Story Points

Description

This task will track the racking of 15 new elastic systems, replacing elastic10[17-31].

Racking Proposal:

This was provided on procurement task T226843.

Current racking configuration (excluding hosts that will be decommissioned):

Row A: 1032(A3), 1033(A3), 1034(A3), 1035(A3), 1044(A6), 1045(A6), 1048(A6) (7 nodes)
Row B: 1036(B3), 1037(B3), 1038(B3), 1039(B3), 1046(B6), 1047(B6), 1049(B4), 1050(B4) (8 nodes)
Row C: 1040(C5), 1041(C5), 1042(C5), 1043(C5), 1051(C7), 1052(C7) (6 nodes)
Row D: none (0 node)

Proposal for new nodes:

Row A: 1053, 1054 (2 nodes, try to avoid A3, then avoid A6)
Row B: 1055 (1 node), Avoid B3, B4, B6 if possible.
Row C: 1056, 1057, 1058 (3 nodes) Avoid C5 first, then avoid C7 if possible
Row D: 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067 (9 nodes) evenly space in 10G racks, there are no elastic nodes staying in this row (the older ones in this row are all going away when these come online as they are in the elastic10[17-31] range.)

Minor imbalance (+- 1 node) between rows is not a major issue, feel free to propose another arrangement if it makes more sense to on-sites. Try to evenly space out elastic nodes in the row evenly in 10G racks.

Common specifications:

  • OS: Debian/Stretch
  • IP / Subnet: internal subnet for each row
  • Partitioning : RAID0 software (current systems use elasticsearch-raid0.cfg if we stay with 2 SSD per node, it should be good)

Installation Checklists

elastic1053:

  • - receive in system on procurement task T226843
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

elastic1054:

  • - receive in system on procurement task T226843
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

elastic1055:

  • - receive in system on procurement task T226843
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

elastic1056:

  • - receive in system on procurement task T226843
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

elastic1057:

  • - receive in system on procurement task T226843
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

elastic1058:

  • - receive in system on procurement task T226843
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

elastic1059:

  • - receive in system on procurement task T226843
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

elastic1060:

  • - receive in system on procurement task T226843
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

Event Timeline

RobH triaged this task as Medium priority.Aug 19 2019, 6:58 PM
RobH created this task.
RobH added a parent task: Unknown Object (Task).Aug 19 2019, 6:58 PM

Try to evenly space out elastic nodes in the row evenly in 1G racks.

All new elastic servers are coming in with 10G cards and should go into 10G racks.

Try to evenly space out elastic nodes in the row evenly in 1G racks.

All new elastic servers are coming in with 10G cards and should go into 10G racks.

fixed!

RobH renamed this task from rack/setup/install elastic10[53-67].eqiad.wmnet to (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet.Sep 9 2019, 4:54 PM

Change 538069 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns entries for new elastic servers

https://gerrit.wikimedia.org/r/538069

Change 538070 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] old mgmt entry that was never removed for wmf3151

https://gerrit.wikimedia.org/r/538070

Change 538070 abandoned by Cmjohnson:
old mgmt entry that was never removed for wmf3151

https://gerrit.wikimedia.org/r/538070

Change 538070 restored by Cmjohnson:
old mgmt entry that was never removed for wmf3151

https://gerrit.wikimedia.org/r/538070

Change 538069 abandoned by Cmjohnson:
Adding mgmt dns entries for new elastic servers

https://gerrit.wikimedia.org/r/538069

Change 538070 merged by Cmjohnson:
[operations/dns@master] old mgmt entry that was never removed for wmf3151

https://gerrit.wikimedia.org/r/538070

Change 538069 restored by Cmjohnson:
Adding mgmt dns entries for new elastic servers

https://gerrit.wikimedia.org/r/538069

Change 538069 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns entries for new elastic servers

https://gerrit.wikimedia.org/r/538069

Do we want 9 host Row D only has 2 10g racks

Row A: 1053, 1054 (2 nodes, try to avoid A3, then avoid A6)
Row B: 1055 (1 node), Avoid B3, B4, B6 if possible.
Row C: 1056, 1057, 1058 (3 nodes) Avoid C5 first, then avoid C7 if possible
Row D: 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067 (9 nodes) evenly space in 10G racks, there are no elastic nodes staying in this row (the older ones in this row are all going away when these come online as they are in the elastic10[17-31] range.)

Cmjohnson subscribed.

I need some clarification on this, please. We do not have 10G space for all of these servers. Especially if you want 9 servers between D2 and D7. There has been a miscommunication in the amount of 10G space we have available. Could someone please clarify whether these need 10G or can they go evenly distributed in 1G racks. Thanks!

The servers today will not be able to utilize 10G, so they could go in 1G racks for the time being. The cluster can't take advantage of 10G until all the nodes are on 10G.

@wiki_willy any idea on a timeline to get those servers racked? At the moment, we have 4 servers down in the eqiad cluster (various issues, 3 of them we won't fix since we have the new servers arrived).

The servers today will not be able to utilize 10G, so they could go in 1G racks for the time being. The cluster can't take advantage of 10G until all the nodes are on 10G.

To be clear, this means that we can rack those servers in 1G atm and we can revisit that when we have upgraded the full cluster.

All host racked and powered, console and network finished for all. Will update netbox and finish idrac settings tomorrow

updated netbox, Finished Idrac and bios setup

Host Switchport
elastic1053 33
elastic1054 30
elastic1055 26
elastic1056 23
elastic1057 25
elastic1058 20
elastic1059 21
elastic1060 11
elastic1061 12
elastic1062 28
elastic1063 39
elastic1064 35
elastic1065 4
elastic1066 7
elastic1067 9

@Gehel The network switches are setup, all the on-site work has been completed

Gehel moved this task from Incoming to Blocked/Waiting on the Discovery-Search (Current work) board.

It looks like there is still a few steps before "handoff for service implementation"

Change 549891 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding dns entries for elastic1053-67

https://gerrit.wikimedia.org/r/549891

Change 549891 merged by Cmjohnson:
[operations/dns@master] Adding dns entries for elastic1053-67

https://gerrit.wikimedia.org/r/549891

Change 549893 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding dhcpd file for elastic10[53-67]

https://gerrit.wikimedia.org/r/549893

Change 549893 merged by Cmjohnson:
[operations/puppet@production] Adding dhcpd file for elastic10[53-67]

https://gerrit.wikimedia.org/r/549893

Change 550116 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding elastic10[53-67].eqiad.wmnet to site.pp role::spare

https://gerrit.wikimedia.org/r/550116

Change 550116 merged by Cmjohnson:
[operations/puppet@production] Adding elastic10[53-67].eqiad.wmnet to site.pp role::spare

https://gerrit.wikimedia.org/r/550116

Cmjohnson updated the task description. (Show Details)
Cmjohnson removed a project: ops-eqiad.

@gelhel The new elasticsearch servers are installed and ready for you, I have assigned you the task and removed my project tag. Please reassign and add ops-eqiad tag back if you have any issues.

@Cmjohnson thanks! I'll have a look and let you know if there are any issues!

Change 550651 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: initial configuration for new elasticsearch servers

https://gerrit.wikimedia.org/r/550651

Change 550651 merged by Gehel:
[operations/puppet@production] elasticsearch: initial configuration for new elasticsearch servers

https://gerrit.wikimedia.org/r/550651

Mentioned in SAL (#wikimedia-operations) [2019-11-13T10:27:19Z] <gehel> start configuration of new elasticsearch servers - T230746

Change 550658 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: add node specific configuration for new servers

https://gerrit.wikimedia.org/r/550658

Change 550658 merged by Gehel:
[operations/puppet@production] elasticsearch: add node specific configuration for new servers

https://gerrit.wikimedia.org/r/550658

Mentioned in SAL (#wikimedia-operations) [2019-11-13T15:29:32Z] <gehel> configuration of new elasticsearch servers completed, all working and pooled - T230746

Mentioned in SAL (#wikimedia-operations) [2019-11-13T16:21:29Z] <gehel> draining elastic1017-1031 to prepare for decommission - T230746