Page MenuHomePhabricator

(Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of es10[26-34].eqiad.wmnet

Hostname / Racking / Installation Details

Hostnames: es1026, es1027, es1028, es1029, es1030, es1031, es1032, es1033, es1034
Racking Proposal:
2 hosts at A1
2 hosts at B1
2 hosts at C3
2 hosts at D8
1 host at A3

Networking/Subnet/VLAN/IP: 1G. Same VLAN internal as the rest of existing es hosts (es1019 for instance)
Partitioning/Raid: RAID10 strip size 256k (@Marostegui will take of adding them to the correct recipe on puppet)
OS Distro: Buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

es1026:

  • - receive in system on procurement task T257785 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620881
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

es1027:

  • - receive in system on procurement task T257785 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620881
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

es1028:

  • - receive in system on procurement task T257785 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620881
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

es1029:

  • - receive in system on procurement task T257785 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620881
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

es1030:

  • - receive in system on procurement task T257785 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620881
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

es1031:

  • - receive in system on procurement task T257785 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620881
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

es1032:

  • - receive in system on procurement task T257785 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620881
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

es1033:

  • - receive in system on procurement task T257785 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620881
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

es1034:

  • - receive in system on procurement task T257785 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible) https://gerrit.wikimedia.org/r/c/operations/puppet/+/620881
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH added a parent task: Unknown Object (Task).Aug 13 2020, 4:24 PM
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH moved this task from Backlog to Acknowledged on the SRE board.
RobH removed a subscriber: RobH.

Change 620881 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Allow install new es hosts in eqiad/codfw

https://gerrit.wikimedia.org/r/620881

Change 620881 merged by Marostegui:
[operations/puppet@production] mariadb: Allow install new es hosts in eqiad/codfw

https://gerrit.wikimedia.org/r/620881

I have merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/620881 so the hosts will get installed with RAID10, notifications disabled and spare role.
Pending from DC-Ops are the usual DNS and DHCP commits.

@wiki_willy do you think it is feasible to have these hosts racked&installed by 30th Oct?

Hi @Marostegui - I think that should be doable. During my sync up with @Cmjohnson and @RobH tomorrow, we'll discuss and see if we can get a solid ETA for you. Thanks, Willy

wiki_willy added a subscriber: Jclark-ctr.

Confirmed with @Cmjohnson and @RobH today, that these es1026-1034 hosts will be ready for you by end of October. Thanks, Willy

@Jclark-ctr I see that the task has been resolved in coupa but I don't see the servers anywhere and they're not in netbox. Where are they?

@Cmjohnson - I just asked John and he says these are in shipping. Thanks Willy

Change 636467 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Add new es servers to site.pp setup role and mac addresses to dhcp

https://gerrit.wikimedia.org/r/636467

Change 636467 merged by Cmjohnson:
[operations/puppet@production] Add new es servers to site.pp setup role and mac addresses to dhcp

https://gerrit.wikimedia.org/r/636467

Per my chat with Chris, updating the rack location from A2 to A1 and from C2 to C3

Cmjohnson updated the task description. (Show Details)
Cmjohnson added a subscriber: RobH.

@RobH These still need the raid setup, you mentioned you could do that. If not please let me know and I will take care of it. Other than that they are ready for install.

Change 636972 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] updating site.pp entry for new ES servers in eqiad

https://gerrit.wikimedia.org/r/636972

Change 636972 merged by Cmjohnson:
[operations/puppet@production] updating site.pp entry for new ES servers in eqiad

https://gerrit.wikimedia.org/r/636972

Change 636974 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] site.pp: Remove duplicate external store entries.

https://gerrit.wikimedia.org/r/636974

Change 636974 merged by Marostegui:
[operations/puppet@production] site.pp: Remove duplicate external store entries.

https://gerrit.wikimedia.org/r/636974

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

es1026.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202010301901_robh_31694_es1026_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['es1026.eqiad.wmnet']

Of which those FAILED:

['es1026.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

es1026.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202010301950_robh_13320_es1026_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['es1026.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['es1027.eqiad.wmnet', 'es1028.eqiad.wmnet', 'es1029.eqiad.wmnet', 'es1030.eqiad.wmnet', 'es1031.eqiad.wmnet', 'es1032.eqiad.wmnet', 'es1033.eqiad.wmnet', 'es1034.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010302017_robh_8109.log.

Completed auto-reimage of hosts:

['es1027.eqiad.wmnet', 'es1029.eqiad.wmnet', 'es1028.eqiad.wmnet', 'es1033.eqiad.wmnet', 'es1031.eqiad.wmnet', 'es1030.eqiad.wmnet', 'es1032.eqiad.wmnet', 'es1034.eqiad.wmnet']

and were ALL successful.

RobH updated the task description. (Show Details)

All installations complete and hosts are calling into puppet. They've all been set to staged in netbox, and the DBA team can set them to active when they deploy them to service.

RobH removed RobH as the assignee of this task.Oct 30 2020, 8:51 PM

Thanks @Cmjohnson and @RobH for prioritizing this one. Nice work getting it turned over in time. Thanks, Willy

es1032 has RAID0 instead of RAID10.
Can we get that one re-done with RAID10 and strip size 256?

Thanks!

root@es1032:~# megacli -LDPDInfo -aAll | grep "RAID Level"
RAID Level          : Primary-0, Secondary-0, RAID Level Qualifier-0
root@es1032:~# pvs
  PV         VG   Fmt  Attr PSize  PFree
  /dev/sda3  tank lvm2 a--  21.78t <12.69t

All the other hosts look good. from disk, memory, strip size, cpu....point of view.

es1032 has RAID0 instead of RAID10.
Can we get that one re-done with RAID10 and strip size 256?

Thanks!

root@es1032:~# megacli -LDPDInfo -aAll | grep "RAID Level"
RAID Level          : Primary-0, Secondary-0, RAID Level Qualifier-0
root@es1032:~# pvs
  PV         VG   Fmt  Attr PSize  PFree
  /dev/sda3  tank lvm2 a--  21.78t <12.69t

All the other hosts look good. from disk, memory, strip size, cpu....point of view.

Apologies! I set them ALL wrong and I thought I had fixed.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

es1032.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202011030007_robh_23724_es1032_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['es1032.eqiad.wmnet']

Of which those FAILED:

['es1032.eqiad.wmnet']

This gets to the debian loader, and halts on 'Probing EDD' which had no issues on the other hosts. I'm still investigating on what is different about this than the other es hosts in this batch.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

es1032.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202011030044_robh_30404_es1032_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['es1032.eqiad.wmnet']

Of which those FAILED:

['es1032.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

es1032.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202011030045_robh_31476_es1032_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['es1032.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['es1032.eqiad.wmnet']

and were ALL successful.

Done!

Thanks Rob, es1032 looks good now:

Name                :Virtual Disk 0
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 10.913 TB
Sector Size         : 512
Is VD emulated      : No
Mirror Data         : 10.913 TB

root@es1032:~# pvs
  PV         VG   Fmt  Attr PSize   PFree
  /dev/sda3  tank lvm2 a--  <10.87t 1.77t
root@es1032:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   9.1T   11G  9.1T   1% /srv