⚓ T227542 b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC)

	Subject	Repo	Branch	Lines +/-
	ps1-b7-eqiad model setting	operations/puppet	production	+4 -3

		Status	Subtype	Assigned	Task
		Resolved		• Cmjohnson	T226778 Install new PDUs in rows A/B (Top level tracking task)
		Resolved		Jclark-ctr	T227542 b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC)

RobH created this task.Jul 8 2019, 10:48 PM

RobH mentioned this in T226778: Install new PDUs in rows A/B (Top level tracking task).Jul 8 2019, 10:50 PM

• Cmjohnson moved this task from Backlog to High Priority Task on the ops-eqiad board.Jul 22 2019, 2:41 PM

lvs1014 here will need special care, Traffic should stop puppet and pybal and monitor failover to lvs1016 ahead of work, then revert afterwards. cp1081 and cp1082 here can be depooled as normal.

RobH moved this task from High Priority Task to Blocked on the ops-eqiad board.Jul 26 2019, 1:37 PM

wiki_willy renamed this task from b7-eqiad pdu refresh to b7-eqiad pdu refresh (Tuesday 11/5 @11am UTC).Aug 15 2019, 5:38 PM

RobH triaged this task as High priority.Aug 28 2019, 6:31 PM

RobH updated the task description. (Show Details)

RobH removed RobH as the assignee of this task.Aug 28 2019, 6:39 PM

RobH updated the task description. (Show Details)

RobH set Due Date to Nov 15 2019, 12:00 AM.

RobH changed Due Date from Nov 15 2019, 12:00 AM to Nov 15 2019, 11:00 AM.

RobH added subscribers: ayounsi, • Nuria, Joe.

elukey updated the task description. (Show Details)Sep 17 2019, 6:03 AM

elukey added a subscriber: herron.

RobH updated the task description. (Show Details)Oct 11 2019, 8:42 PM

• JHedden updated the task description. (Show Details)Oct 21 2019, 6:57 PM

• JHedden subscribed.

herron updated the task description. (Show Details)Oct 28 2019, 1:47 PM

wiki_willy renamed this task from b7-eqiad pdu refresh (Tuesday 11/5 @11am UTC) to b7-eqiad pdu refresh (Tuesday 11/5 @10am UTC).Oct 30 2019, 12:26 AM

wiki_willy renamed this task from b7-eqiad pdu refresh (Tuesday 11/5 @10am UTC) to b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC).Nov 4 2019, 4:19 PM

RobH updated the task description. (Show Details)Nov 4 2019, 5:26 PM

• jcrespo subscribed.Nov 4 2019, 5:27 PM

I don't want to conflict-edit the task description, but as far as the MW* and WTP* servers no action is needed.

• jcrespo updated the task description. (Show Details)Nov 4 2019, 5:28 PM

wiki_willy assigned this task to • Cmjohnson.Nov 4 2019, 6:28 PM

Mentioned in SAL (#wikimedia-cloud) [2019-11-05T11:59:38Z] <arturo> icinga downtime for 1h cloudcontrol1004, cloudnet1003, cloudvirt1017/1020/1022 for PDU operations in the rack T227542

starting pdu refresh .

finished pdu refresh

# pmshell

 1: ps1-a1-eqiad    2: ps1-a2-eqiad    3: ps1-a3-eqiad    4: ps1-a4-eqiad   
 5: ps1-a5-eqiad    6: ps1-a6-eqiad    7: ps1-a7-eqiad    8: ps1-a8-eqiad   
 9: ps1-b1-eqiad   10: ps1-b2-eqiad   11: ps1-b3-eqiad   12: ps1-b4-eqiad   
13: ps1-b5-eqiad   14: ps1-b6-eqiad   15: ps1-b7-eqiad   16: ps1-b8-eqiad   
17: asw-a1-eqiad   18: asw-a2-eqiad   19: asw-a3-eqiad   20: asw-a4-eqiad   
21: asw-a5-eqiad   22: asw-a6-eqiad   23: asw-a7-eqiad   24: asw-a8-eqiad   
25: asw-b1-eqiad   26: asw-b2-eqiad   27: asw-b3-eqiad   28: asw-b4-eqiad   
29: asw-b5-eqiad   30: asw-b6-eqiad   31: asw-b7-eqiad   32: asw-b8-eqiad   
33: re0.cr1-eqiad  34: re1.cr1-eqiad  35: re0.cr2-eqiad  36: re1.cr2-eqiad  
37: mr1-eqiad      40: msw1-eqiad     41: asw2-a5-eqiad  45: asw2-a3-eqiad  

Connect to port > 15  

Sentry Smart PDU Version 8.0n

Username:

Change 548769 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] ps1-b7-eqiad model setting

https://gerrit.wikimedia.org/r/548769

gerritbot added a project: Patch-For-Review.Nov 5 2019, 3:50 PM

Change 548769 merged by RobH:
[operations/puppet@production] ps1-b7-eqiad model setting

https://gerrit.wikimedia.org/r/548769

Maintenance_bot removed a project: Patch-For-Review.Nov 5 2019, 4:10 PM

- clear icinga errors for missing ps2 input by connecting/checking connection of the rj11 cable connection between ps1 and ps2 b7-eqiad. Once it is connected, the icinga errors for the tower B infeed will clear up.

@Jclark-ctr: Please see the update above and address, thanks!

Once the towers are linked, the errors on https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ps1-b7-eqiad should clear up and go green for tower B.

confirmed link and errors cleared from icinga

device	role	SRE team coordination	recommended action during maintainance
asw-b7-eqiad	asw	@ayounsi	ensure this doesn't go offline as it will take entire rack network offline
wtp1033
wtp1032
wtp1031
kafka-main1002		@herron	To avoid alert noise from adjacent kafka-main hosts, schedule icinga downtime for "Kafka Broker Under Replicated Partitions" service on kafka-main100[123] as well. Perform graceful shutdown of server before maintenance, and ensure powered up when completed.
dbprov1002	db provisioning/backup generation host	DBA	Really nothing to do, but @jcrespo will keep an eye on it
cloudvirtan1005
cloudvirtan1004
an-worker1087		@Nuria
an-worker1086		@Nuria
cp1082	cp system	Traffic	T227542#5355289
cp1081	cp system	Traffic	T227542#5355289
ms-be1041	ms-be system	fillipo	gracefully shutdown the host just before rack maintainance, and power it back online post-maintainance.
cloudvirt1022	cloudvirt host	cloud-services-team	@JHedden No running VMs, can happen anytime
analytics1073		Analytics	fine to do any time
lvs1014	lvs system	@BBlack	T227542#5355289
cloudvirt1020	cloudvirt host	cloud-services-team	@JHedden has running VMs please handle with care
druid1005		Analytics	fine to do any time
ores1003
cloudnet1003		cloud-services-team	@JHedden is active but it has a redundant peer
restbase-dev1005
cloudcontrol1004		cloud-services-team	@JHedden is active but it has a redundant peer
cloudvirt1017	cloudvirt	cloud-services-team	@JHedden has a large number of running VMs, please handle with care
mw1318	mw server	@Joe
mw1317	mw server	@Joe
mw1316	mw server	@Joe
mw1315	mw server	@Joe
mw1314	mw server	@Joe
mw1313	mw server	@Joe

b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC)
Closed, ResolvedPublic
Actions

Description

List of routers, switches, and servers

Details

Related Objects
Search...

Event Timeline

b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC)Closed, ResolvedPublicActions