Page MenuHomePhabricator

(Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and installation of 6 new external store servers for codfw.

Shared Info

Hostnames: From es2020.codfw.wmnet to es2025.codfw.wmnet
Racking Proposal:
es2020.codfw.wmnet A3?
es2021.codfw.wmnet B3
es2022.codfw.wmnet C6?
es2023.codfw.wmnet D3
es2024.codfw.wmnet A6?
es2025.codfw.wmnet B8
Networking/Subnet/VLAN/IP: 1G private vlan, same vlan as the normal databases.
Partitioning/Raid: RAID10 strip size 256k (@Marostegui will take of adding them to the correct recipe on puppet)

Individual Server Checklists

es2020: row A rack A3/ ge-3/0/27

  • - receive in system on procurement task T235820
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

es2021: row B rack B3/ ge-3/0/20

  • - receive in system on procurement task T235820
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

es2022: row C rack C6/ ge-6/0/5

  • - receive in system on procurement task T235820
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

es2023: row D rack D3/ ge-3/0/15

  • - receive in system on procurement task T235820
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

es2024: row A rack A6/ ge-60/12

  • - receive in system on procurement task T235820
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

es2025: row B rack B8/ ge-8/0/3

  • - receive in system on procurement task T235820
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1-row-vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet for its row)
  • - operations/puppet update (install_server at minimum, other files if possible):
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Marostegui in the racking proposal, you reaqusted that es2020 be racked in A2 and es2022 in C2. A2 and C2 are 10G racks so we need to move those 2 servers into a 1G rack.

Thanks

Papaul triaged this task as Medium priority.Dec 22 2019, 11:17 PM
Papaul moved this task from Backlog to Racking Tasks on the ops-codfw board.

@Papaul sorry he will be away, I don't have a clear solution, but I believe any other place on the same row, but not the same rack as the others will do.

Please note the servers are called es20XX NOT ex20XX (External Store). I will edit the summary with alternative proposal and fixed names.

@Papaul sorry he will be away, I don't have a clear solution, but I believe any other place on the same row, but not the same rack as the others will do.

Please note the servers are called es20XX NOT ex20XX (External Store). I will edit the summary with alternative proposal and fixed names.

Correct, any other place within the same row and if possible avoiding colliding with any other es20XX host in the same rack would be ideal.

@jcrespo we will have to move es2024: Row A rack A4 too since A4 is also a 10G rack.

Thanks.

@jcrespo we will have to move es2024: Row A rack A4 too since A4 is also a 10G rack.

Thanks.

Anywhere within row A is fine as long as they are in separate racks.

Btw, the partitioning recipe was already defined in puppet a few days ago, so that is also good to go.

Jaime's proposal of A3, C6 and A5 look good to me (thanks!)
@Papaul let us know if that also works for you.

wiki_willy renamed this task from codfw: rack/setup/install es202[0-5].codfw.wmnet to (No Need By Date Provided) codfw: rack/setup/install es202[0-5].codfw.wmnet.Jan 2 2020, 11:41 PM
Marostegui renamed this task from (No Need By Date Provided) codfw: rack/setup/install es202[0-5].codfw.wmnet to (Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet.Jan 3 2020, 6:26 AM

Change 561750 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] site.pp: Add new external store hosts as spare

https://gerrit.wikimedia.org/r/561750

Change 561750 merged by Marostegui:
[operations/puppet@production] site.pp: Add new external store hosts as spare

https://gerrit.wikimedia.org/r/561750

@Papaul - reminder the RAID10 is done with 256K (the reminder is just because we recently found old some servers with RAID stripe being set to 64K)

Marostegui added a parent task: Unknown Object (Task).Jan 9 2020, 2:20 PM

Change 563323 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Add mgmt and production DNS for es202[0-5]

https://gerrit.wikimedia.org/r/563323

es2021 has some issues I lwill look into it later

Critical Thu 09 Jan 2020 23:24:33 System BIOS has halted.
Normal Thu 09 Jan 2020 23:22:14 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:13 An OEM diagnostic event occurred.
Critical Thu 09 Jan 2020 23:22:13 System BIOS has halted.
Normal Thu 09 Jan 2020 23:22:13 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:13 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:13 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:13 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:13 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:13 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:13 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:12 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:12 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:12 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:12 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:12 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:12 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:12 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:12 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:12 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:12 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:12 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:11 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:11 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:22:11 An OEM diagnostic event occurred.
Critical Thu 09 Jan 2020 23:22:11 CPU 1 machine check error detected.
Critical Thu 09 Jan 2020 23:21:01 Correctable memory error logging disabled for a memory device at location DIMM_A5.
Normal Thu 09 Jan 2020 23:21:01 An OEM diagnostic event occurred.
Normal Thu 09 Jan 2020 23:21:01 An OEM diagnostic event occurred.

Change 563323 merged by Marostegui:
[operations/dns@master] DNS: Add mgmt and production DNS for es202[0-5]

https://gerrit.wikimedia.org/r/563323

papaul@asw-a-codfw# show | compare 
[edit interfaces interface-range vlan-private1-a-codfw]
     member xe-4/0/20 { ... }
+    member ge-3/0/27;
+    member ge-6/0/12;
[edit interfaces interface-range disabled]
-    member ge-6/0/12;
-    member ge-3/0/27;
[edit interfaces]
+   ge-3/0/27 {
+       description es2020;
+   }
+   ge-6/0/12 {
+       description es2024;
+   }
papaul@asw-b-codfw# show | compare 
[edit interfaces interface-range vlan-private1-b-codfw]
     member xe-4/0/5 { ... }
+    member ge-3/0/20;
+    member ge-8/0/3;
[edit interfaces interface-range disabled]
-    member ge-8/0/3;
-    member ge-3/0/20;
[edit interfaces]
+   ge-3/0/20 {
+       description es2021;
+   }
+   ge-8/0/3 {
+       description es2025;
+   }

Change 563496 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Add MAC address entries for es202[02345]

https://gerrit.wikimedia.org/r/563496

Change 563496 merged by Dzahn:
[operations/puppet@production] DHCP: Add MAC address entries for es202[02345]

https://gerrit.wikimedia.org/r/563496

Change 563600 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Testing Buster on es2024

https://gerrit.wikimedia.org/r/563600

Change 563600 merged by Papaul:
[operations/puppet@production] DHCP: Testing Buster on es2024

https://gerrit.wikimedia.org/r/563600

Change 564723 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: es20[0-5]

https://gerrit.wikimedia.org/r/564723

Change 564723 merged by Marostegui:
[operations/puppet@production] mariadb: es20[0-5]

https://gerrit.wikimedia.org/r/564723

@Papaul you can proceed with the installation of es2021, es2022, es2023, es2024 and es2025
es2020 is installed already

Thanks!

Thank you Papaul!

Memory and disk space looks good

[06:06:33] marostegui@cumin1001:~$ sudo cumin 'es202*.codfw.wmnet' 'free -g ; df -hT /srv/'
6 hosts will be targeted:
es[2020-2025].codfw.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(5) es[2021-2025].codfw.wmnet
----- OUTPUT of 'free -g ; df -hT /srv/' -----
              total        used        free      shared  buff/cache   available
Mem:            251           0         249           0           0         249
Swap:             7           0           7
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    11T   12G   11T   1% /srv
===== NODE GROUP =====
(1) es2020.codfw.wmnet
----- OUTPUT of 'free -g ; df -hT /srv/' -----
              total        used        free      shared  buff/cache   available
Mem:            251           0         250           0           0         249
Swap:             7           0           7
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    11T   12G   11T   1% /srv

RAID level and stripe size also looks good:

[06:06:44] marostegui@cumin1001:~$ sudo cumin 'es202*.codfw.wmnet' 'megacli -LDPDInfo -aAll  | egrep -i "RAID LEVEL|Strip"'
6 hosts will be targeted:
es[2020-2025].codfw.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(6) es[2020-2025].codfw.wmnet
----- OUTPUT of 'megacli -LDPDInf...AID LEVEL|Strip"' -----
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Strip Size          : 256 KB
Marostegui reassigned this task from Marostegui to Papaul.
Marostegui updated the task description. (Show Details)