Page MenuHomePhabricator

codfw: Master PDU rack/setup row A, row B, rowC and row D task
Open, HighPublic

Description

This task will track the racking, setup, and configuration of 31 new PDU sets ordered via T303460 for installation into the racks in rows A ,B,C and D. These will replace the existing PDUs and require direct coordination and scheduling by the codfw dc ops for work in each rack.

Schedule ROW A

RackDateTimeComments
A1Waiting for PDU's
A2June 21st9:30am CT/2:30 pm UTC~2hours to complete
A3June 23rd9:30am CT/2:30 pm UTC~2 hours 15 minutes to complete
A4June 30th9:30am CT/2:30pm UTC~ 1 hour 15 minutes to complete
A5July 12th9:30am CT/2:30pm UTCCY1 disconnected the whole rack by mistake more than 2 hours
A6July 14th9:30am CT/2:30pm UTC~ 1 hour 45 minutes
A7August 2nd9:30am CT/2:30pm UTC
A8Waiting for PDUs

Schedule ROW B

RackDateTimeComments
B1August 2nd10:00 am CT/3:00pm UTC
B2August 2nd10:30 am CT/3:30pm UTC
B3August 3rd09:30 am CT/2:30pm UTC
B4August 2nd11:00 am CT/4:00pm UTC
B5August 2nd11:30 am CT/4:30pm UTC
B6August 3rd10:00 am CT/3:00pm UTC
B7August 3rd10:30 am CT/3:30pm UTC
B8August 3rd11:00 am CT/4:00pm UTC

Schedule ROW C

RackDateTimeComments
C1August 3rd11:30 am CT/4:30pm UTC
C2August 3rd12:00 pm CT/5:00pm UTC
C3Already replaced
C4August 4th09:30 am CT/2:30pm UTC
C5August 4th10:00 am CT/3:00pm UTC
C6August 4th10:30 am CT/3:30pm UTC
C7August 4th11:00 am CT/4:00pm UTC
C8August 11th09:30 am CT/2:30pm UTC

Schedule ROW D

RackDateTimeComments
D1June 28th9:30am CT/2:30 pm UTC~ 1 hour 30 mins
D2August 4th11:30 am CT/4:30pm UTC
D3August 4th12:00 pm CT/5:00pm UTC
D4August 10th09:30 am CT/2:30pm UTC
D5August 10th10:00 am CT/3:00pm UTC
D6August 10th10:30 am CT/3:30pm UTC
D7August 10th11:00 am CT/4:00pm UTC
D8August 10th11:30 am CT/4:30pm UTC

Event Timeline

Papaul triaged this task as Medium priority.Jun 6 2022, 4:29 AM
Papaul updated the task description. (Show Details)
Papaul added a subscriber: Jgreen.
Papaul updated the task description. (Show Details)
ayounsi raised the priority of this task from Medium to High.Thu, Aug 4, 7:45 AM
ayounsi added a subscriber: ayounsi.

There are currently 3 Icinga alerts for servers with a failed PSU:
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=kafka-main2002&service=IPMI+Sensor+Status - since Tuesday
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ml-serve2006&service=IPMI+Sensor+Status
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=mw2322&service=IPMI+Sensor+Status

This should be fixed before continuing the maintenance.
Monitoring should also be checked at the end of each maintenance to ensure there are no remaining issues.

Similarly this has been alerting for 1d15h for failed PSU https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=es2021&service=IPMI+Sensor+Status

For the record: asw-c7-codfw (which is a row's spine) rebooted yesterday System booted: 2022-08-04 16:38:08 UTC

Mentioned in SAL (#wikimedia-operations) [2022-08-10T09:31:52Z] <jelto> depool services in codfw for upcoming PDU replacement - T309956