Page MenuHomePhabricator

codfw: Next Gen test rack
Closed, ResolvedPublic

Description

This task will track down the redesign of 1 rack in codfw. The goal is to be able to standardize the cables ( power, network) management devices to use .

Which rack to use?
It will be good practice to use a full rack. we have the option to use rack B3 or rack C3. I will leave this to the service owner to decide which one will be ideal since we will have to depool all servers in that rack.

Which rack?
C3

What Device to move?

  • move the msw from U23 to U24
  • move cable management for U24 to U23
  • Replace old mgmt switch with new one.

What else?

  • Servers will now have colored power cords: the first power supply connected to PDU 1 will have blue power cord and the second power supply connected to PDU2 will have red power cord.
  • mgmt cables are replaced from 5ft and 7ft to 4ft and 6ft same for the production cables
  • - Replacing the existing horizontal cable management with another type
  • - Putting in a new Vertical cable manangement
  • - Putting in new PDU's

What date?
June 9th or June 11th and the two options need to confirm with service owner

What time?
9:30am to 12pm CT

Service owner:
Manuel for db2113 : OK
Daniel for mw servers: OK
Alex for thumbor servers: OK

Event Timeline

Papaul triaged this task as Medium priority.Apr 30 2020, 8:22 PM
Papaul created this task.
Papaul moved this task from Backlog to Racking Tasks on the ops-codfw board.Apr 30 2020, 11:53 PM
Peachey88 updated the task description. (Show Details)May 1 2020, 10:53 PM
Papaul updated the task description. (Show Details)May 20 2020, 4:41 PM

Chatted with @wkandek today on the proposed B3 or C3 racks, along with the June 9 or 11th dates/times for the mw servers. He'll check with his team on Monday, and confirm with us afterwards. Thanks, Willy

Papaul updated the task description. (Show Details)May 29 2020, 8:12 PM

As we discussed a few days ago on IRC, I will have db2113 depooled 24h before the day you pick. Any day works for me.

Dzahn added a subscriber: Dzahn.Jun 3 2020, 8:23 AM

All the mw servers in rack C3 have been decom'ed.

Also see T247018#6187845

Dzahn updated the task description. (Show Details)Jun 3 2020, 8:24 AM

Mentioned in SAL (#wikimedia-operations) [2020-06-09T06:51:25Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2113 for on-site maintenance T251570', diff saved to https://phabricator.wikimedia.org/P11419 and previous config saved to /var/cache/conftool/dbconfig/20200609-065125-marostegui.json

Change 603807 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2113: Disable notifications

https://gerrit.wikimedia.org/r/603807

Change 603807 merged by Marostegui:
[operations/puppet@production] db2113: Disable notifications

https://gerrit.wikimedia.org/r/603807

Mentioned in SAL (#wikimedia-operations) [2020-06-09T06:53:13Z] <marostegui> Stop MySQL on db2113 for maintenance - T251570

I have powered off db2113. Once you are done with the maintenance, please power it back on.
Thank you and good luck!

Mentioned in SAL (#wikimedia-operations) [2020-06-09T09:57:38Z] <akosiaris> correction: depool and set as inactive thumbor200{1,2} for T251570

Icinga downtime for 2 days, 0:00:00 set by akosiaris@cumin1001 on 2 host(s) and their services with reason: poweroff for T251750

thumbor[2001-2002].codfw.wmnet

thumbor2001 and thumbor2002 have been set as inactive, downtime for 2 days in icinga and powered off. Once you are done, please power them back on and I 'll take it from there. Thanks!

Papaul added a comment.Jun 9 2020, 5:02 PM

it took 3 hours 1/2 to rebuild the whole rack. Pictures coming soon

Papaul updated the task description. (Show Details)Jun 9 2020, 7:52 PM
Papaul updated the task description. (Show Details)Jun 9 2020, 8:55 PM

Change 604240 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2113: Enable notifications

https://gerrit.wikimedia.org/r/604240

Change 604240 merged by Marostegui:
[operations/puppet@production] db2113: Enable notifications

https://gerrit.wikimedia.org/r/604240

Mentioned in SAL (#wikimedia-operations) [2020-06-10T07:05:08Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db2113 after on-site maintenance T251570', diff saved to https://phabricator.wikimedia.org/P11438 and previous config saved to /var/cache/conftool/dbconfig/20200610-070508-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-06-10T12:13:06Z] <akosiaris> pool thumbor2002, thumbor2001. T251570

Papaul added a comment.EditedSun, Jun 14, 6:50 PM

@wiki_willy Please see below for the purchase information on the vertical duct used in C3.
https://www.altex.com/6-black-slotted-wall-wiring-duct-3-x-3-w-cover

wiki_willy added a comment.EditedMon, Jun 15, 4:48 PM

Thanks @Papaul - we'll use the link provided for future purchases

Thanks,
Willy

I open a ticket # 1674336 with CY1 to disconnect and connect new PDU's tomorrow at 9:30am CT

Papaul closed this task as Resolved.Tue, Jun 30, 7:56 PM

Both new PDU's in rack C3 are in installed and configured.

Problem: moved all network devices power to PS1 before disconnecting PS2. when Tech was ready to disconnect PS2 below floor, he accidentally disconnect PS1 causing all the network devices to go down.
Lesson learned: Next time I will move one of the power for the asw in one PDU in another rack.

The task is now complete.

Papaul updated the task description. (Show Details)Tue, Jun 30, 7:57 PM

Change 609174 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] facilities: update model for ps1-c3-codfw

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609174

Change 609174 merged by Filippo Giunchedi:
[operations/puppet@production] facilities: update model for ps1-c3-codfw

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609174