⚓ T227139 a3-eqiad pdu refresh

		Status	Subtype	Assigned	Task
		Resolved		• Cmjohnson	T226778 Install new PDUs in rows A/B (Top level tracking task)
		Resolved		None	T227139 a3-eqiad pdu refresh

RobH created this task.Jul 2 2019, 7:59 PM

RobH mentioned this in T226778: Install new PDUs in rows A/B (Top level tracking task).Jul 2 2019, 8:02 PM

RobH updated the task description. (Show Details)Jul 2 2019, 8:37 PM

RobH added subscribers: ayounsi, akosiaris, fgiunchedi.

RobH updated the task description. (Show Details)Jul 9 2019, 12:16 AM

All the analytics nodes are hadoop workers, not a big deal if they loose power.

Mentioned in SAL (#wikimedia-operations) [2019-07-23T04:43:46Z] <marostegui> Failover m1 from dbproxy1001 to dbproxy1006 T227139

@RobH I have failed over dbproxy1001 to dbproxy1006 so this rack is good to go from the DB point of view.

akosiaris updated the task description. (Show Details)Jul 23 2019, 6:44 AM

emptying ganeti1007 will require some 15-30 mins of advance notice. Emptying it (in case me or @MoritzMuehlenhoff are not around) can be done per https://wikitech.wikimedia.org/wiki/Ganeti#Node_operations using:

sudo gnt-node migrate -f ganeti1007

kubernetes1001 can be done at any point in time, but if feeling like being nice, the node can be drained beforehand per https://wikitech.wikimedia.org/wiki/Kubernetes#Rebooting_a_worker_node. Care must be taken so the node is uncordoned after that, leaving the node as cordoned (not capable to receive pods) will eventually lead to an outage.

fgiunchedi updated the task description. (Show Details)Jul 23 2019, 9:14 AM

• Marostegui updated the task description. (Show Details)Jul 23 2019, 9:16 AM

restbase / logstash / graphite / prometheus hosts should be fine in an event of power loss, if feeling nice restbase and prometheus should be depooled. for the logstash host we could disable es replication before and reenable it afterwards, to avoid shuffling data around on power loss

fgiunchedi updated the task description. (Show Details)Jul 23 2019, 9:27 AM

Mentioned in SAL (#wikimedia-operations) [2019-07-23T12:01:07Z] <akosiaris> empty ganeti1007 from running instances. T227139

Mentioned in SAL (#wikimedia-operations) [2019-07-23T12:02:11Z] <akosiaris> drain kubernetes1001. T227139

FYI: I pinged both Alex and Filippo to drain the respective servers they mention above in anticipation of swapping the PDUs in this rack at 10:00 Eastern time.

A3 was originally a DB rack, and has an older PDU model with less plugs than the other remaining PDUs (exception of networking racks) in rows A/B. It is also fairly sparsely populated at this time, so it is ideal to swap.

RobH updated the task description. (Show Details)Jul 23 2019, 12:05 PM

RobH triaged this task as High priority.Jul 23 2019, 12:08 PM

RobH updated the task description. (Show Details)

In T227139#5356540, @fgiunchedi wrote:

restbase / logstash / graphite / prometheus hosts should be fine in an event of power loss,

This is graphite1003, the old server pending decomission, the currently active one in eqiad (1004) is on a different rack.

• MoritzMuehlenhoff updated the task description. (Show Details)Jul 23 2019, 12:34 PM

jijiki updated the task description. (Show Details)Jul 23 2019, 1:09 PM

jijiki updated the task description. (Show Details)

jijiki subscribed.

Mentioned in SAL (#wikimedia-operations) [2019-07-23T13:45:39Z] <godog> depool restbase1016 restbase1019 restbase1011 restbase1010 prometheus1003 ahead of PDU work - T227139

Mentioned in SAL (#wikimedia-operations) [2019-07-23T14:14:28Z] <robh> a3-eqiad pdu swap taking place now via T227139

All of the power has been migrated, and we are now setting up the networkign for the new pdus

All done. Elastic1031 has a PSU issue, and we lost power to dbproxy1003 (it was not in service) during this migration.

Mentioned in SAL (#wikimedia-operations) [2019-07-23T16:22:11Z] <godog> pool prometheus1003 - T227139

RobH mentioned this in T226782: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC).Jul 24 2019, 3:19 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-25T09:21:25Z] <marostegui> Failover m1 from dbproxy1006 to dbproxy1001 - T227139

• ema mentioned this in T229328: ps1 eqiad Icinga UNKNOWNs.Jul 30 2019, 1:16 PM

RobH removed RobH as the assignee of this task.Aug 28 2019, 6:40 PM

device	role	SRE team coordination	notes
asw2-a3-eqiad	asw	@ayounsi
analytics1060	analytics	Analytics
analytics1059	analytics	Analytics
analytics1057	analytics	Analytics
analytics1056	analytics	Analytics
analytics1055	analytics	Analytics
analytics1054	analytics	Analytics
analytics1052	analytics	Analytics
elastic1031	elastic	Discovery-Search
elastic1030	elastic	Discovery-Search
logstash1010		observability	ok with power loss, nice to have: disable es replication
cloudservices1004		cloud-services-team
restbase1016		@fgiunchedi	ok with power loss
kubernetes1001	kubernetes	serviceops
rdb1005	misc redis	serviceops	ok with powerloss
restbase1019		@fgiunchedi	ok with power loss
restbase1011		@fgiunchedi	ok with power loss
restbase1010		@fgiunchedi	ok with power loss
graphite1003			awaiting decom
relforge1001
db1103	db	DBA
dbproxy1003	dbproxy	DBA
elastic1035	elastic	Discovery-Search
elastic1034	elastic	Discovery-Search
elastic1033	elastic	Discovery-Search
elastic1032	elastic	Discovery-Search
cp1008	cp	Traffic
dbstore1003	dbstore	Analytics
prometheus1003		observability	ok with power loss
ganeti1007	ganeti host	@akosiaris	host will need to be emptied in advance
dbproxy1001	dbproxy	DBA
dbproxy1002	dbproxy	DBA
db1127	db	DBA
radium

a3-eqiad pdu refresh
Closed, ResolvedPublic
Actions

Description

List of routers, switches, and servers

Related Objects
Search...

Event Timeline

a3-eqiad pdu refreshClosed, ResolvedPublicActions

Description

List of routers, switches, and servers

Related ObjectsSearch...

Event Timeline

a3-eqiad pdu refresh
Closed, ResolvedPublic
Actions

Related Objects
Search...