Page MenuHomePhabricator

rack/setup/install pc1007-pc1010
Closed, ResolvedPublic

Description

This task will track the racking and setup of the 4 hosts pc1007, pc1008, pc1009, pc1010, purchased on {T195876}.

Racking Plan: We don't care about the racks, but just let's make sure they all go to different rows.
Important: These hosts need RAID 5

pc1007:

  • - receive in system on procurement task T195876
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID 5
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private vlan vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch)
  • - puppet accept/initial run
  • - handoff for service implementation T208383

pc1008:

  • - receive in system on procurement task T195876
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID 5
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private vlan vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch)
  • - puppet accept/initial run
  • - handoff for service implementation T208383 - related task updated with note this server is ready for DBA use.

pc1009:

  • - receive in system on procurement task T195876
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID 5
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private vlan vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch)
  • - puppet accept/initial run
  • - handoff for service implementation T208383 - related task updated with note this server is ready for DBA use.

pc1010:

  • - receive in system on procurement task T195876
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID 5
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private vlan vlan)
    • end on-site specific steps
  • - production dns entries added (private subnet)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch)
  • - puppet accept/initial run
  • - handoff for service implementation T208383 - related task updated with note this server is ready for DBA use.

Event Timeline

Marostegui triaged this task as Medium priority.Oct 17 2018, 7:45 AM
Marostegui moved this task from Triage to Blocked external/Not db team on the DBA board.
Marostegui added a subtask: Unknown Object (Task).

Change 468221 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Allow install the new pc hosts

https://gerrit.wikimedia.org/r/468221

Change 468221 merged by Marostegui:
[operations/puppet@production] install_server: Allow install the new pc hosts

https://gerrit.wikimedia.org/r/468221

Cmjohnson closed subtask Unknown Object (Task) as Resolved.Oct 25 2018, 4:59 PM

Change 470849 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns for new servers p1007-10

https://gerrit.wikimedia.org/r/470849

@Cmjohnson any ETA to get these racked&installed?

Thanks

@Cmjohnson reminder: this is RAID5 instead of 10 as noted on top of the task.

Change 470849 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns for new servers p1007-10

https://gerrit.wikimedia.org/r/470849

Cmjohnson subscribed.

Rob,

Can you complete the installs of pc1008-pc1010. The server used for pc1007 arrived DOA and a ticket with Dell needs to be submitted. Thanks!

Marostegui raised the priority of this task from Medium to High.Nov 26 2018, 5:08 PM

Setting up to high as we need to get the old ones out before the leasing deadline

Change 475810 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] pc1009-pc1010 install params

https://gerrit.wikimedia.org/r/475810

Change 475810 merged by RobH:
[operations/puppet@production] pc1009-pc1010 install params

https://gerrit.wikimedia.org/r/475810

Change 475820 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] production dns for pc1008-1010

https://gerrit.wikimedia.org/r/475820

Change 475820 merged by RobH:
[operations/dns@master] production dns for pc1008-1010

https://gerrit.wikimedia.org/r/475820

Assigning back to Chris for the followup to repair pc1007. The other servers have been handed off to DBA team for use via task T208383

pc1008, pc1009 and pc1010 look good!

@Cmjohnson Actually pc1008 needs to get the RAID rebuilt - it has strip size 64.
The other two pc1009 and pc1010 are ok and have 256.

Thanks for the fast response @Cmjohnson! Will you re-install it or should I?
Thanks!

@Marostegui if you don't mind can you do the reinstall. Thanks

@Marostegui if you don't mind can you do the reinstall. Thanks

Will do! Thank you!

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

pc1008.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201811271357_marostegui_5996_pc1008_eqiad_wmnet.log.

Thanks @Cmjohnson it looks good now!

RAID Level          : Primary-5, Secondary-0, RAID Level Qualifier-3
Size                : 4.364 TB
Sector Size         : 512
Is VD emulated      : No
Parity Size         : 1.454 TB
State               : Optimal
Strip Size          : 256 KB

Completed auto-reimage of hosts:

['pc1008.eqiad.wmnet']

and were ALL successful.

Dell ticket information for pc1007 You have successfully submitted request SR983104667.

Waiting on a technician to swap out the motherboard. Our request was approved.

Awesome! Thank you @Cmjohnson! :)
If you get it online today, reminder: RAID5 with 256 stripe! (Reminding it because it is not the usual config)

Thanks a lot

@Marostegui and all,

the system board that was replaced yesterday was faulty. Showing errors on DIMM slots B4 and B1. After swapping DIMMs in B with DIMMs in A, the error remained B4 and B1. Also tried swapping CPUs but the errors remained the same. A new board has been ordered along with a couple of replacement DIMM just in case

@Marostegui and all,

the system board that was replaced yesterday was faulty. Showing errors on DIMM slots B4 and B1. After swapping DIMMs in B with DIMMs in A, the error remained B4 and B1. Also tried swapping CPUs but the errors remained the same. A new board has been ordered along with a couple of replacement DIMM just in case

Hey Chris!
Any ETA for the new board?

the tech came today to swap pc1007 system board and the new (refurbed) board is bad again. This will require another call into Dell and will not be fixed until after the holiday break.

Terrible!
Thanks for the heads up Chris!

I would like to insist on this issue now that the holiday is over- while the service (parsercache) is not at the time affected, we are in a no-hw redundancy mode on eqiad, and after all it was the vendor that sent faulty hardware in the first place. Please escalate to us or a manager if you need help "fighting". Happy 2019 and thanks!

@jcrespo An email was sent to Dell requesting a new board. I have not received a response

@faidon @RobH can we follow up with Dell to see what's going on a more formal way? This server has been unusable since it arrived and it is brand new :-|

@faidon @RobH can we follow up with Dell to see what's going on a more formal way? This server has been unusable since it arrived and it is brand new :-|

I've gone ahead and sent Chris the contact list for our entire Dell team. He can then email then (CCing myself and Faidon) explaining the issue and what has been done so far.

@RobH @Marostegui I went through the very long and painful Dell troubleshooting and it's one of those cases where it actually worked. The server is ready to install.

Steps were taken
Bring server down to the bare minimum to operate, 1 CPU, 1 DIMM

Next steps
add DIMM and CPU 2 back one at a time and reboot several times to see if the error returns

The issue turned out to be a bad connection on the raid card (the very last step). The problem has been fixed and the system is ready to install

Good job!! Thank you!
Is the RAID 5 made already? If it is only OS install pending, I can take it from there

@Cmjohnson you are the best, the worse Dell is, the more superb you are to cover for their mess. How many beers do I own you already? XD Thanks again.

Ah no, I think all the mgmt entries, vlan and all those steps are pending so I cannot proceed until those are set up.
(Just tried to access mgmt, which was not successful).

mgmt works now :-) I will wait for the green light from @Cmjohnson to proceed with the install
Thank you for getting this almost done!

*update not ready for install. I set the wrong raid. I am updating the driver now and will fix to raid 5 once the update is complete. @Marostegui odd...may have somethign to do with the f/w update in progress

return shipping info for parts

USPS 9202 3946 5301 2440 4873 91
Fedex 9611918 2393026 77237414

@RobH I can do the OS installation once you give me the green light for it.

production dns updated

NIC.Embedded.1-1-1 Ethernet = D0:94:66:75:D1:63

Change 483854 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Install pc1007

https://gerrit.wikimedia.org/r/483854

Change 483854 merged by Marostegui:
[operations/puppet@production] install_server: Install pc1007

https://gerrit.wikimedia.org/r/483854

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['pc1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201901140729_marostegui_133064.log.

Completed auto-reimage of hosts:

['pc1007.eqiad.wmnet']

and were ALL successful.

pc1007 got installed and looks good:

root@pc1007:~# megacli -LDPDInfo -aAll

Adapter #0

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-5, Secondary-0, RAID Level Qualifier-3
Size                : 4.364 TB
Sector Size         : 512
Is VD emulated      : No
Parity Size         : 1.454 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 4
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

root@pc1007:~# free -g
              total        used        free      shared  buff/cache   available
Mem:            251          14         232           0           4         235
Swap:             7           0           7

root@pc1007:~# df -hT
Filesystem            Type      Size  Used Avail Use% Mounted on
udev                  devtmpfs  126G     0  126G   0% /dev
tmpfs                 tmpfs      26G  9.7M   26G   1% /run
/dev/sda1             ext4       37G  2.5G   33G   8% /
tmpfs                 tmpfs     126G     0  126G   0% /dev/shm
tmpfs                 tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs                 tmpfs     126G     0  126G   0% /sys/fs/cgroup
/dev/mapper/tank-data xfs       4.4T  9.2G  4.4T   1% /srv
Marostegui updated the task description. (Show Details)