Page MenuHomePhabricator

rack/setup/install db2[103-120].codfw.wmnet (18 hosts)
Closed, ResolvedPublic

Description

This task will track the racking, setup, and installation of the 18 new db hosts in codfw.

RAID: 10
RAID Stripe Size: 256kB

Rack proposal:

  • 4 hosts on row A (different racks if possible)
  • 5 hosts on row B (different racks if possible)
  • 5 hosts on row C (different racks if possible)
  • 4 hosts on row D (different racks if possible)

db2103: A3 ge-3/0/23

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

db2104: A5 ge-5/0/14

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

db2105:A6 ge-6/0/7

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

db2106: A8 ge-8/0/3

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

db2107: B1 ge-1/0/19

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

db2108: B3 ge-3/0/24

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

db2109: B5 ge-5/0/23

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

db2110: B6 ge-6/0/1

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

db2111: B6 ge-6/0/2

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

db2112: C1 ge-1/0/10

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

db2113: C3 ge-3/0/0

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

db2114: C5 ge-5/0/30

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

db2115: C6 ge-6/0/0

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

db2116: C6 ge-6/0/7

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

db2117: D1 ge-1/0/1

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

db2118: D3 ge-3/0/12

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

db2119: D3 ge-3/0/13

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

db2120: D5 ge-5/0/0

  • - receive in system on procurement task T220431
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID10 stripsize 256kb
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

Event Timeline

Marostegui triaged this task as Medium priority.Apr 22 2019, 5:53 AM
Marostegui moved this task from Triage to Blocked external/Not db team on the DBA board.

Keep in mind that db2033 can be decommissioned (it is on C6) T220070

Change 505695 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Allow install from db2103 to db2120

https://gerrit.wikimedia.org/r/505695

Change 505695 merged by Marostegui:
[operations/puppet@production] mariadb: Allow install from db2103 to db2120

https://gerrit.wikimedia.org/r/505695

Papaul updated the task description. (Show Details)

Change 507613 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Add mgmt and production DNS for db2[103-120]

https://gerrit.wikimedia.org/r/507613

Change 507613 merged by Marostegui:
[operations/dns@master] DNS: Add mgmt and production DNS for db2[103-120]

https://gerrit.wikimedia.org/r/507613

Change 508472 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Add MAC address entries for db2[103-120]

https://gerrit.wikimedia.org/r/508472

Change 508472 merged by Dzahn:
[operations/puppet@production] DHCP: Add MAC address entries for db2103 through db2120

https://gerrit.wikimedia.org/r/508472

Mentioned in SAL (#wikimedia-operations) [2019-05-07T00:53:16Z] <mutante> install2002 - disabling puppet, live hacking DHCP config for db2103 to not serve installer via http to debug install issue for T221532 which seems like T190424#4548003

I tried to PXE boot the first server, on the switch side everything looks good since I can see that the switch learned the MAC address from the server and the log file on install2002 is showing that the server is getting DHCP but the server boot and stops at "Serving lpxelinx.0 to 10.192.0.118:1414" and keeps rebooting

@ayounsi @RobH These servers have an install issue where they get a DHCP ACK followed by "Serving stretch-installer/debian-installer/amd64/pxelinux.0 " but then it stops and nothing else happens.

We found T190424#4548003 -> T190424#4564126 and that eventually ended with T190424#4567081

As you can see in the log entry above we have already tried the same thing that has been tried in that ticket at first.. serving the installer via tftp instead of http but that did not fix the issue.

Since that ticket eventually was resolved after finding "old/weird firewall filter term denying all packets with a TTL != 64 out of that vlan" we are wondering if this could also be the case here. The behaviour seems the same. Could you take a look?

The issue was in the BIOS setting. The boot mode was set to UEFI after changing it to BIOS it works.

I can confirm db2103 looks good.

root@db2103:~#  free -g
              total        used        free      shared  buff/cache   available
Mem:            502           1         500           0           0         498
Swap:             7           0           7
root@db2103:~# megacli -LDInfo -Lall -aALL


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 5.237 TB
Sector Size         : 512
Is VD emulated      : No
Mirror Data         : 5.237 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 6
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: No
LD has drives that support T10 power conditions: No
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: No



Exit Code: 0x00
root@db2103:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   5.2T  6.2G  5.2T   1% /srv

Change 508492 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2103: Disable notifications

https://gerrit.wikimedia.org/r/508492

Change 508492 merged by Marostegui:
[operations/puppet@production] db2103: Disable notifications

https://gerrit.wikimedia.org/r/508492

@Marostegui @jcrespo please hold on to db2114. It looks like the system has some Hardware issues, I am investigating.

Thanks Papaul
We won't do anything to any of the hosts until we've got the green light from you in this ticket
Thanks!

db2114

Critical,Tue 07 May 2019 10:04:30,Fan redundancy is lost.,
Normal,Tue 07 May 2019 10:03:33,The fans are redundant.,
Critical,Tue 07 May 2019 10:00:35,Fan redundancy is lost.,
Normal,Tue 07 May 2019 09:59:35,The fans are redundant.,
Critical,Mon 06 May 2019 10:59:53,Fan redundancy is lost.,
Critical,Mon 06 May 2019 10:59:51,Fan 3B RPM is less than the lower critical threshold.,
Warning,Mon 06 May 2019 10:59:51,Fan 3B RPM is less than the lower warning threshold.,
Critical,Mon 06 May 2019 10:35:11,Fan redundancy is lost.,
Normal,Mon 06 May 2019 10:34:04,The fans are redundant.,
Critical,Mon 06 May 2019 05:51:45,Fan redundancy is lost.,
Normal,Mon 06 May 2019 05:51:02,The fans are redundant.,
Critical,Mon 06 May 2019 05:49:50,Fan 3B RPM is less than the lower critical threshold.,
Warning,Mon 06 May 2019 05:49:50,Fan 3B RPM is less than the lower warning threshold.,
Critical,Mon 06 May 2019 05:49:46,Fan redundancy is lost.,
Normal,Mon 06 May 2019 05:48:54,The fans are redundant.,
Normal,Mon 06 May 2019 05:48:52,Fan 3B RPM is within range.,
Warning,Mon 06 May 2019 05:48:52,Fan 3B RPM is less than the lower warning threshold.,
Critical,Mon 06 May 2019 05:46:50,Fan 3B RPM is less than the lower critical threshold.,
Warning,Mon 06 May 2019 05:46:49,Fan 3B RPM is less than the lower warning threshold.,
Critical,Mon 06 May 2019 05:46:46,Fan redundancy is lost.,
Normal,Mon 06 May 2019 05:42:41,The fans are redundant.,
Normal,Mon 06 May 2019 05:42:37,Fan 3B RPM is within range.,
Warning,Mon 06 May 2019 05:42:36,Fan 3B RPM is less than the lower warning threshold.,
Critical,Mon 06 May 2019 05:42:36,Fan redundancy is lost.,
Normal,Mon 06 May 2019 05:41:55,The fans are redundant.,
Critical,Mon 06 May 2019 05:39:35,Fan 3B RPM is less than the lower critical threshold.,
Warning,Mon 06 May 2019 05:39:35,Fan 3B RPM is less than the lower warning threshold.,
Critical,Mon 06 May 2019 05:39:29,Fan redundancy is lost.,
Normal,Mon 06 May 2019 05:38:50,Fan 3B RPM is within range.,
Warning,Mon 06 May 2019 05:38:49,Fan 3B RPM is less than the lower warning threshold.,
Normal,Mon 06 May 2019 05:38:26,The fans are redundant.,
Critical,Mon 06 May 2019 05:32:43,Fan 3B RPM is less than the lower critical threshold.,
Warning,Mon 06 May 2019 05:32:42,Fan 3B RPM is less than the lower warning threshold.,
Critical,Mon 06 May 2019 05:32:36,Fan redundancy is lost.,
Normal,Mon 06 May 2019 05:30:46,The fans are redundant.,
Normal,Mon 06 May 2019 05:30:36,Fan 3B RPM is within range.,
Warning,Mon 06 May 2019 05:30:36,Fan 3B RPM is less than the lower warning threshold.,
Critical,Mon 06 May 2019 05:30:20,Fan 3B RPM is less than the lower critical threshold.,
Warning,Mon 06 May 2019 05:30:20,Fan 3B RPM is less than the lower warning threshold.,
Critical,Mon 06 May 2019 05:30:15,Fan redundancy is lost.,
Normal,Mon 06 May 2019 05:27:11,Fan 3B RPM is within range.,
Warning,Mon 06 May 2019 05:27:11,Fan 3B RPM is less than the lower warning threshold.,
Normal,Mon 06 May 2019 05:27:05,The fans are redundant.,
Critical,Mon 06 May 2019 05:25:28,Fan 3B RPM is less than the lower critical threshold.,
Warning,Mon 06 May 2019 05:25:27,Fan 3B RPM is less than the lower warning threshold.,
Critical,Mon 06 May 2019 05:25:26,Fan redundancy is lost.,
Normal,Mon 06 May 2019 05:21:19,The fans are redundant.,
Normal,Mon 06 May 2019 05:21:12,Fan 3B RPM is within range.,
Warning,Mon 06 May 2019 05:21:12,Fan 3B RPM is less than the lower warning threshold.,
Critical,Mon 06 May 2019 05:21:02,Fan redundancy is lost.,
Normal,Mon 06 May 2019 05:20:29,The fans are redundant.,
Critical,Mon 06 May 2019 05:19:42,Fan redundancy is lost.,
Normal,Mon 06 May 2019 05:19:09,The fans are redundant.,
Critical,Mon 06 May 2019 05:18:18,Fan 3B RPM is less than the lower critical threshold.,
Warning,Mon 06 May 2019 05:18:18,Fan 3B RPM is less than the lower warning threshold.,
Critical,Mon 06 May 2019 05:18:15,Fan redundancy is lost.,
Normal,Tue 23 Apr 2019 14:49:58,Log cleared.,

it looks like we are missng FAN 3B i am going to open the server and double check

Status Name Type PWM (% of Max) RPM
System Board Fan1A Standard Performance 100% 15960
System Board Fan1B Standard Performance 100% 17040
System Board Fan2A Standard Performance 100% 14760
System Board Fan2B Standard Performance 100% 17520
System Board Fan3A Standard Performance 100% 14760
System Board Fan3B Standard Performance N/A 0
System Board Fan4A Standard Performance 100% 14760
System Board Fan4B Standard Performance 100% 18240
System Board Fan5A Standard Performance 100% 14880
System Board Fan5B Standard Performance 100% 18000
System Board Fan6A Standard Performance 100% 15840
System Board Fan6B Standard Performance 100% 17040

I swapped FAN 3 with FAN 5 still have the same issue so the problem is not the FAN it has to be on the main board. I will contact DELL and open a different task to track the issue.

@Marostegui @jcrespo please fell free to take this task. I open T222753 to track down the problem on db2114.

I have checked that all the hosts have been installed correctly

Marostegui updated the task description. (Show Details)

I have changed the status to Active on netbox.
Will close this task and will create new one for productionization of these hosts and will link it also to the specific one for db2114 T222753: db2114 hardware problem
Thanks Papaul for being so fast racking and installing these