Page MenuHomePhabricator

(Need by: TBD) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and installation of 6 new elastic servers for codfw.

Shared Info

Hostnames: From elastic2055.wikimedia.org to elastic2060.wikimedia.org
Racking Proposal:
Row A (2 servers): elastic20(55|56)
Row B (2 servers): elastic20(57|58)
Row C (1 servers): elastic20(59)
Row D (1 servers): elastic20(60)

Networking/Subnet/VLAN/IP: 10G single NIC (but we won't be able to use the 10G until the whole cluster is upgraded, so 1G is fine for now, provided we can move to 10G in the future). Same VLAN as other elastic2* servers (not sure what the nomenclature is).

Partitioning/Raid: software RAID, RAID0 (elasticsearch-raid0.cfg - already configured for elastic* in netboot.cfg)

Individual Server Checklists

elastic2055: Row A rack A2 xe-2/0/12

  • - receive in system on procurement task T237571
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

elastic2056: Row A rack A7 xe-7/0/10

  • - receive in system on procurement task T237571
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

elastic2057: Row B rack B2 xe-2/0/11

  • - receive in system on procurement task T237571
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

elastic2058: Row B rack B4 xe-4/0/7

  • - receive in system on procurement task T237571
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

elastic2059: Row C rack C7 xe-7/0/12

  • - receive in system on procurement task T237571
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

elastic2060: Row D rack D7 xe-7/0/13

  • - receive in system on procurement task T237571
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Papaul triaged this task as Medium priority.Dec 22 2019, 11:33 PM
Papaul added a subscriber: Mathew.onipe.

@Gehel you mentioned in the procurement task :"but we won't be able to use the 10G until the whole cluster is upgraded," How close are you on doing this? what is blocking us to set the new 6 servers directly into a 10G rack now and setting them into a 1G rack and moving it to 10G later?

Thanks.

wiki_willy renamed this task from codfw: rack/setup/install elastic20{55,56,57,58,59,60}.wikimedia.org to (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.wikimedia.org.Jan 3 2020, 6:53 PM

In terms of "how close", I would have to check but probably 2 years worth of server replacements before the full cluster is 10G.

Nothing is blocking putting them into 10G racks now. The context here is eqiad didn't have enough 10G rack space and some servers had to be put in 1G racks. This is acceptable since the servers wont actually use 10G today. If there is 10G space available in codfw it is likely preferred to rack them there.

Change 566576 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Add mgmt and production DNS for elastic205[5-9] and elastic2060

https://gerrit.wikimedia.org/r/566576

Change 566576 abandoned by Papaul:
DNS: Add mgmt and production DNS for elastic205[5-9] and elastic2060

https://gerrit.wikimedia.org/r/566576

@EBernhardson @Gehel are we using public VlAN for these servers? just double checking since the other elastic servers are in the private VLAN.

Change 566623 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Add mgmt DNS for elastic205[5-9],elastic2060

https://gerrit.wikimedia.org/r/566623

Change 566623 merged by Dzahn:
[operations/dns@master] DNS: Add mgmt DNS for elastic205[5-9],elastic2060

https://gerrit.wikimedia.org/r/566623

Talked to @Gehel on IRC those servers will be in the private VLAN and not in the public VLAN with Stretch as OS.

Papaul renamed this task from (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.wikimedia.org to (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet.Feb 13 2020, 11:50 PM

Change 572120 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Add production DNS elastic2055 to elastic2060

https://gerrit.wikimedia.org/r/572120

Change 572120 merged by Papaul:
[operations/dns@master] DNS: Add production DNS elastic2055 to elastic2060

https://gerrit.wikimedia.org/r/572120

Change 572128 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Add NAC address for elastic2055 to elastic2060

https://gerrit.wikimedia.org/r/572128

Change 572128 merged by Papaul:
[operations/puppet@production] DHCP: Add MAC address for elastic2055 to elastic2060

https://gerrit.wikimedia.org/r/572128

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

elastic2055.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002140204_pt1979_533_elastic2055_codfw_wmnet.log.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

elastic2056.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002140207_pt1979_1054_elastic2056_codfw_wmnet.log.

Completed auto-reimage of hosts:

['elastic2055.codfw.wmnet']

Of which those FAILED:

['elastic2055.codfw.wmnet']

Completed auto-reimage of hosts:

['elastic2056.codfw.wmnet']

Of which those FAILED:

['elastic2056.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

elastic2057.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002181543_pt1979_23146_elastic2057_codfw_wmnet.log.

Completed auto-reimage of hosts:

['elastic2057.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

elastic2058.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002182256_pt1979_1154_elastic2058_codfw_wmnet.log.

Completed auto-reimage of hosts:

['elastic2058.codfw.wmnet']

and were ALL successful.

Change 573020 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Change MAC address of elastic20[5-6],elastic2059 and elastic2060 from 1G MAC to 10G MAC

https://gerrit.wikimedia.org/r/573020

Change 573020 merged by Dzahn:
[operations/puppet@production] DHCP: Change MAC address of some elastic servers to 10G interface

https://gerrit.wikimedia.org/r/573020

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

elastic2055.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002190030_pt1979_18701_elastic2055_codfw_wmnet.log.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

elastic2056.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002190031_pt1979_18996_elastic2056_codfw_wmnet.log.

Completed auto-reimage of hosts:

['elastic2055.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['elastic2056.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

elastic2059.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002190109_pt1979_27068_elastic2059_codfw_wmnet.log.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

elastic2060.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002190117_pt1979_29506_elastic2060_codfw_wmnet.log.

Completed auto-reimage of hosts:

['elastic2059.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['elastic2060.codfw.wmnet']

and were ALL successful.

Papaul updated the task description. (Show Details)

@Gehel All yours let me know if you have any questions.

wiki_willy renamed this task from (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet to (Need by: TBD) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet.Feb 26 2020, 1:43 AM

@Gehel can you please make another task for service implementation and resolve this task ?

Thanks