Page MenuHomePhabricator

Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of <enter the FQDN/hostname of the hosts being setup here>

Hostname / Racking / Installation Details

Hostnames: elastic10[68-83].eqiad.wmnet
Racking Proposal:

This is a refresh so we'll keep the same configuration as before (evenly spread across rows in order to reduce fallout from row failure)
Current racking configuration:

Row A (9 servers): elastic10(32|33|34|35|44|45|48|53|54)
Row B (9 servers): elastic10(36|37|38|39|46|47|49|50|55)
Row C (9 servers): elastic10(40|41|42|43|51|52|56|57|58)
Row D (9 servers): elastic10(59|60|61|62|63|64|65|66|67)

Networking/Subnet/VLAN/IP: 10G, same subnet as existing elastic10* servers (private VLAN)
Partitioning/Raid: RAID0 software (elasticsearch-raid0.cfg - already configured for elastic* in netboot.cfg)
OS Distro: Buster (default unless otherwise specified)

Per host setup checklist

elastic1068:

  • - receive in system on procurement task T279158 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1069:

  • - receive in system on procurement task T279158 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1070:

  • - receive in system on procurement task T279158 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1071:

  • - receive in system on procurement task T279158 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1072:

  • - receive in system on procurement task T279158 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1073:

  • - receive in system on procurement task T279158 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1074:

  • - receive in system on procurement task T279158 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1075:

  • - receive in system on procurement task T279158 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1076:

  • - receive in system on procurement task T279158 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1077:

  • - receive in system on procurement task T279158 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1078:

  • - receive in system on procurement task T279158 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1079:

  • - receive in system on procurement task T279158 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1080:

  • - receive in system on procurement task T279158 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1081:

  • - receive in system on procurement task T279158 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1082:

  • - receive in system on procurement task T279158 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

elastic1083:

  • - receive in system on procurement task T279158 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH removed a subscriber: RobH.
RobH mentioned this in Unknown Object (Task).May 5 2021, 2:48 PM

@Gehel would like to confirm racking. evenly spread across rows? it is replacing elastic[1032-1047] those are only in rows A,B,C

@RKemper can you confirm these need to be 10g space is limited in eqiad for 10g and these would be only elastic host on 10g

Networking Yes we need these to be in 10G rows; we now have enough elastic hosts with 10G NICs to get the whole fleet on 10G networking at which point we can increase our elasticsearch cluster settings to take advantage of the higher bandwidth

Racking Let's keep the same rows of the hosts being replaced (rows A-C) rather than evenly distributing the new hosts over (A-D). Ultimately every host will be on 10G so we don't need to worry about row D not having 10G or anything like that

Edit: And (following up convo with Jclark) some of the new hosts need to be put into D in order to find enough 10G racking space, that's okay. We ideally want to maintain a perfect 9 hosts per row but if we need a bit more variance (7-8 in one row and 10-11 in another or so), while not ideal that's an acceptable tradeoff if necessary to get the 10G networking

Current state

*Before* accounting for new 10G switches opened up by https://phabricator.wikimedia.org/T280203, we've got the following 10G spots open:

(+4, +5, +9, 0) in (A, B, C, D)

So in order to satisfy the need for 10G with the racking of these replacement hosts, we'd have to "steal" 2 spots from C for A, and another 1 spot from C for B, giving us the following # of hosts in each row:
(7, 8, 12, 9) in (A, B, C, D)

In this configuration the worst row for us to lose would be Row C, which would contain exactly 33% of the number of hosts...that's probably an acceptable amount of variance (if spots were unlimited it'd be 25% of total hosts in each rack so we're not losing too much)

*After* accounting for https://phabricator.wikimedia.org/T280203, which will open up 7 new spots in Row A, we're no longer constrained in row A so it could look like this:

(9, 8, 9, 10) in (A, B, C, D)

So for now we'll have this ticket wait on the decom ticket, and then we'll be able to get these refresh hosts on 10G networking with only a slight hit to the evenness of our distribution across rows. Meanwhile I'll get confirmation from @Gehel that the above checks out.

due to constraint with open spaces for racking i will hold till i get confirmation on proposed racking. currently available spaces are 10g I have 4=RowA 5=RowB 9=RowC 0=RowD.
Possibly after https://phabricator.wikimedia.org/T280203 is completed it will allow us to keep servers more balanced between 4 rows opening 7 more spots in row A

RobH renamed this task from (Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet to Q1:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet.Aug 26 2021, 7:46 PM
wiki_willy renamed this task from Q1:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet to Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet.Aug 26 2021, 10:28 PM

@Jclark-ctr We put this ticket on hold while https://phabricator.wikimedia.org/T280203 was getting closed out (to free up some more 10G slots so that we can fit all of these elastic*) hosts. Is this ticket ready to move forward again, now that T280203 has been resolved?

@RKemper Started working on this again today installing rails and racking. should move along quickly now that more locations have been opened up.

elastic1068 A4 u6 cableid 11048 port39
elastic1069 A4 u13 cableid 11049 port40
elastic1070 A7 u16 cableid 11052 port38
elastic1071 A7 u17 cableid 11056 port39
elastic1072 A7 u18 cableid 11053 port40
elastic1073 A7 u19 cableid 11057 port41
elastic1074 B2 u33 cableid 11044 port18
elastic1075 B2 u34 cableid 11047 port19
elastic1076 B4 u7 cableid 11045 port 1
elastic1077 B4 u9 cableid 11046 port 2
elastic1078 B4 u40 cableid 11050 port 17
elastic1079 B4 u41 cableid 11051 port 22
elastic1080 C4 u17 cableid 11043 port46
elastic1081 C4 u27 cableid 11055 port47
elastic1082 C7 u24 cableid 11054 port 1
elastic1083 C7 u27 cableid 11058 port 3

Jclark-ctr updated the task description. (Show Details)
Jclark-ctr added a subscriber: Jclark-ctr.

dns and port descriptions updated

Change 724133 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding new elastic servers to dhcp and site.pp

https://gerrit.wikimedia.org/r/724133

Change 724133 merged by Cmjohnson:

[operations/puppet@production] Adding new elastic servers to dhcp and site.pp

https://gerrit.wikimedia.org/r/724133

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

elastic1068.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109271730_cmjohnson_19258_elastic1068_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

elastic1068.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109271820_cmjohnson_26890_elastic1068_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

elastic1069.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109271824_cmjohnson_27372_elastic1069_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

elastic1070.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109271825_cmjohnson_27475_elastic1070_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

elastic1071.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109271826_cmjohnson_27574_elastic1071_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

elastic1072.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109271834_cmjohnson_28178_elastic1072_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

elastic1073.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109271839_cmjohnson_30826_elastic1073_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

elastic1074.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109271840_cmjohnson_31304_elastic1074_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['elastic1068.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

elastic1075.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109271845_cmjohnson_1038_elastic1075_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

elastic1076.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109271846_cmjohnson_3337_elastic1076_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

elastic1077.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109271849_cmjohnson_4715_elastic1077_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['elastic1071.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['elastic1070.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

elastic1078.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109271854_cmjohnson_8834_elastic1078_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

elastic1079.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109271854_cmjohnson_8901_elastic1079_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

elastic1080.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109271855_cmjohnson_8977_elastic1080_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

elastic1081.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109271856_cmjohnson_9052_elastic1081_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

elastic1082.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109271857_cmjohnson_9152_elastic1082_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['elastic1069.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

elastic1083.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109271857_cmjohnson_9249_elastic1083_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['elastic1072.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['elastic1073.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['elastic1074.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['elastic1076.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['elastic1075.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['elastic1077.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['elastic1078.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['elastic1079.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['elastic1080.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['elastic1083.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['elastic1082.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['elastic1081.eqiad.wmnet']

and were ALL successful.

Cmjohnson updated the task description. (Show Details)