Page MenuHomePhabricator

(Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and installation of 6 new external store servers for eqiad (T235659).

Shared Info

Hostnames: From es1020.eqiad.wmnet to es1025.eqiad.wmnet
Racking Proposal:
es1020.eqiad.wmnet A3
es1021.eqiad.wmnet B3
es1022.eqiad.wmnet C5
es1023.eqiad.wmnet D6
es1024.eqiad.wmnet A5
es1025.eqiad.wmnet B5

Networking/Subnet/VLAN/IP: 1G private vlan, same vlan as the normal databases.
Partitioning/Raid: RAID10 strip size 256k (@Marostegui will take of adding them to the correct recipe on puppet)

Individual Server Checklists

es1020: Row A rack A3

  • - receive in system on procurement task T235659
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - RAID 10 strip size 256k
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

es1021: Row B rack B3

  • - receive in system on procurement task T235659
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - RAID 10 strip size 256k
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

es1022: Row C rack C5

  • - receive in system on procurement task T235659
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - RAID 10 strip size 256k
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

es1023: Row D rack D6

  • - receive in system on procurement task T235659
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - RAID 10 strip size 256k
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

es1024: Row A rack A5

  • - receive in system on procurement task T235659
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - RAID 10 strip size 256k
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

es1025: Row B rack B5

  • - receive in system on procurement task T235659
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - RAID 10 strip size 256k
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

Event Timeline

Task has been created, @Marostegui will take care of creating the racking strategy on January, but if for some reason you are in a hurry (we are not), it will be distributed among rows (as much as reasonable) for redundancy, not having 2 of these on the same rack, and as much as possible and reasonable, also avoiding other es10* hosts on the same rack.

I believe I already defined the racking proposal at T235659

wiki_willy added a parent task: Unknown Object (Task).
wiki_willy moved this task from Backlog to Racking Tasks on the ops-eqiad board.

Partitioning recipe was also already defined in puppet some days ago so we are good on that front too.

Thanks guys

wiki_willy renamed this task from eqiad: rack/setup/install es102[0-5].eqiad.wmnet to (No Need By Date) eqiad: rack/setup/install es102[0-5].eqiad.wmnet.Jan 2 2020, 11:36 PM
Marostegui renamed this task from (No Need By Date) eqiad: rack/setup/install es102[0-5].eqiad.wmnet to (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet.Jan 3 2020, 6:27 AM

You guys think this will be ready by 31st Jan?
Thanks.

When adding the MAC addresses to the DHCP file, make sure to add the following line:

option pxelinux.pathprefix "http://apt.wikimedia.org/tftpboot/stretch-installer-bootif/";

Example of how it is done in codfw:

host es2023 {
    hardware ethernet B0:26:28:F5:16:C8;
    fixed-address es2023.codfw.wmnet;
    option pxelinux.pathprefix "http://apt.wikimedia.org/tftpboot/stretch-installer-bootif/";
}

You guys think this will be ready by 31st Jan?
Thanks.

Any ETA on when we can expect these hosts to be ready?

@Marostegui I do not know yet, there are several racking/setup tasks that I am trying to get through. I need to check with @Jclark-ctr and see if they're even in racks yet.

@Marostegui I do not know yet, there are several racking/setup tasks that I am trying to get through. I need to check with @Jclark-ctr and see if they're even in racks yet.

If it helps, I can do the OS installation myself if DC-Ops do the switches, RAIDs, DNS and send me the MAC addresses.
Thanks!

These host are not in racks yet. I can rack these today but do not have ip`s yet so can not setup yet. @Cmjohnson if you can add ip`s to this ticket i can configure host. once i get those turn around should be quick

I see that the racking recomendation is both in 10g and 1g racks @Marostegui and A1 is a network rack

would these racks work for you?
es1020.eqiad.wmnet A3
es1021.eqiad.wmnet B3
es1022.eqiad.wmnet C5
es1023.eqiad.wmnet D6
es1024.eqiad.wmnet A5
es1025.eqiad.wmnet B5

I see that the racking recomendation is both in 10g and 1g racks @Marostegui and A1 is a network rack

would these racks work for you?
es1020.eqiad.wmnet A3
es1021.eqiad.wmnet B3
es1022.eqiad.wmnet C5
es1023.eqiad.wmnet D6
es1024.eqiad.wmnet A5
es1025.eqiad.wmnet B5

Those racks work for me.

These hosts have 1G and 10G interfaces, but we will only be using 1G for the time being.

Change 568577 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] adding mgmt dns for es1019-1025

https://gerrit.wikimedia.org/r/568577

Change 568577 merged by Cmjohnson:
[operations/dns@master] adding mgmt dns for es1019-1025

https://gerrit.wikimedia.org/r/568577

updated mgmt dns

+es1020 1H IN A 10.65.4.144
+es1021 1H IN A 10.65.4.145
+es1022 1H IN A 10.65.4.146
+es1023 1H IN A 10.65.4.147
+es1024 1H IN A 10.65.4.148
+es1025 1H IN A 10.65.4.149

@Jclark-ctr @wiki_willy @Cmjohnson any rough estimation on when we can expect these hosts to be online?
As I said, I am happy to do the OS installation myself once you've done the racking/BIOS/RAID + idrac + switches steps. Just send me the MACs addresses and I will take care of the puppet stuff and OS installs
Reminder, RAID10 (256k stripsize)

Thanks!

Below are switch ports for host are racked cabled and updated netbox. Handing over to chris to configure bios/ raid

es1020.eqiad.wmnet A3; 35
es1021.eqiad.wmnet B3; 16
es1022.eqiad.wmnet C5; 4
es1023.eqiad.wmnet D6; 28
es1024.eqiad.wmnet A5; 34
es1025.eqiad.wmnet B5; .26

Change 572327 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding production dns for new es servers

https://gerrit.wikimedia.org/r/572327

dns has been updated, I need to verify the network ports. I have a conflict with es1025...that port shows utilized (@Jclark-ctr)

Change 572327 merged by Cmjohnson:
[operations/dns@master] Adding production dns for new es servers

https://gerrit.wikimedia.org/r/572327

Change 572953 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding macs for es102[0-5] to dhcpd file

https://gerrit.wikimedia.org/r/572953

Change 572953 merged by Cmjohnson:
[operations/puppet@production] Adding macs for es102[0-5] to dhcpd file

https://gerrit.wikimedia.org/r/572953

Change 572970 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] updating dhpd file for es102[0-5]. used the wrong eth port

https://gerrit.wikimedia.org/r/572970

Change 572970 merged by Cmjohnson:
[operations/puppet@production] updating dhpd file for es102[0-5]. used the wrong eth port

https://gerrit.wikimedia.org/r/572970

Change 573069 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Pass bootif installer to new ES hosts

https://gerrit.wikimedia.org/r/573069

Change 573069 merged by Marostegui:
[operations/puppet@production] install_server: Pass bootif installer to new ES hosts

https://gerrit.wikimedia.org/r/573069

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['es1020.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202002190823_marostegui_56932.log.

Completed auto-reimage of hosts:

['es1020.eqiad.wmnet']

Of which those FAILED:

['es1020.eqiad.wmnet']

Completed auto-reimage of hosts:

['es1020.eqiad.wmnet']

Of which those FAILED:

['es1020.eqiad.wmnet']

This is due to the new Debian 9.12 point release, Moritz will rebuild the image.

Mentioned in SAL (#wikimedia-operations) [2020-02-19T10:09:57Z] <moritzm> updated tftpboot environment for stretch-bootif for the 9.12 point release T241359

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['es1020.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202002191011_marostegui_76403.log.

Completed auto-reimage of hosts:

['es1020.eqiad.wmnet']

and were ALL successful.

es1020 installed correctly: RAID10, 256k strip size, BBU and Cache policy right disk space, memory and cpus

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['es1021.eqiad.wmnet', 'es1022.eqiad.wmnet', 'es1023.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202002191045_marostegui_82897.log.

Completed auto-reimage of hosts:

['es1022.eqiad.wmnet', 'es1023.eqiad.wmnet', 'es1021.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['es1022.eqiad.wmnet', 'es1023.eqiad.wmnet', 'es1021.eqiad.wmnet']

and were ALL successful.

RAID10, 256k strip size, BBU and Cache policy right disk space, memory and cpus looking good.

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['es1024.eqiad.wmnet', 'es1025.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202002191128_marostegui_94337.log.

Completed auto-reimage of hosts:

['es1024.eqiad.wmnet']

Of which those FAILED:

['es1024.eqiad.wmnet']

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['es1024.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202002191402_marostegui_125718.log.

es1025: RAID10, 256k strip size, BBU and Cache policy right disk space, memory and cpus looking good.

@Cmjohnson can you double check es1024's link?
It cannot PXE boot:

Booting from PXE Device 1: Integrated NIC 1 Port 1 Partition 1
PXE: No media detected.
Boot Failed: PXE Device 1: Integrated NIC 1 Port 1 Partition 1

Booting from PXE Device 1: Integrated NIC 1 Port 1 Partition 1
PXE: No media detected.
Boot Failed: PXE Device 1: Integrated NIC 1 Port 1 Partition 1

No boot device available or Operating System detected.
Please ensure a compatible bootable media is available.

Available Actions:
F1 to Continue and Retry Boot Order
F2 for System Setup (BIOS)
F10 for Lifecycle Controller
- Enable/Configure iDRAC
- Update or Backup/Restore Server Firmware
- Help Install an Operating System
F11 for Boot Manager

Completed auto-reimage of hosts:

['es1024.eqiad.wmnet']

and were ALL successful.

es1024: RAID10, 256k strip size, BBU and Cache policy right disk space, memory and cpus looking good.

Marostegui updated the task description. (Show Details)

All hosts have been installed successfully.
Thanks!