Page MenuHomePhabricator

b1-eqiad pdu refresh (Thursday 10/10 @11am UTC)
Closed, ResolvedPublic

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack B1-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

  • - schedule downtime for the entire list of switches and servers.
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower

List of routers, switches, and servers

deviceroleSRE team coordinationnotes
asw2-b1-eqiadasw@ayounsi
es1014esDBA@Marostegui to depool it before the maintenance
es1013esDBA@Marostegui to depool it before the maintenance
ms-be1022ms-be@fgiunchedipoweroff / poweron
db1084dbDBA@Marostegui to depool it before the maintenance
db1083dbDBA@Marostegui to depool it before the maintenance
kafka-jumbo1003Analyticsfine to do any time
db1077dbDBAtest host, nothing to be done
db1076dbDBA@Marostegui to depool it before the maintenance
db1112dbDBA@Marostegui to depool it before the maintenance
logstash1011@fgiunchediok with power loss, nice to have: disable es replication
cloudvirt1026cloudvirtcloud-services-team@aborrero: running 39 VMs, please handle with care
cloudvirt1025cloudvirtcloud-services-team@aborrero: running 46 VMs, please handle with care
dbstore1004dbstoreAnalytics
cloudvirt1023cloudvirtcloud-services-team@aborrero: OK, not running any VM.
an-coord1001Analyticsfine to do any time but please ping Analytics first
dbproxy1014dbproxyDBAnothing to be done, not active
authdna1001authdnsTraffic
db1124dbDBAsanitarium host, nothing to be done
snapshot1008@ArielGlenn
db1118dbDBA@Marostegui to depool it before the maintenance
wdqs1007wdqsDiscovery-Search@Gehel good to go

Event Timeline

RobH updated the task description. (Show Details)
RobH added subscribers: ayounsi, ArielGlenn, fgiunchedi.

Adding @hoo because wikidata entity dumps will be impacted.

elukey subscribed.

Some heads up could be good for me to gracefully stop daemons on an-coord1001. For kafka-jumbo1003 it is fine if it doesn't risk to loose power together with other kafka-jumbo nodes (2 down are tolerable, more probably not).

Marostegui subscribed.

From the DB side this can be done anytime

RobH removed RobH as the assignee of this task.Aug 14 2019, 4:51 PM
wiki_willy renamed this task from b1-eqiad pdu refresh to b1-eqiad pdu refresh (Thursday 10/10 @11am UTC).Aug 15 2019, 5:33 PM
CDanis triaged this task as Medium priority.Aug 16 2019, 1:01 PM

Change 541707 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool es1013, es1014

https://gerrit.wikimedia.org/r/541707

Change 541707 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool es1013, es1014

https://gerrit.wikimedia.org/r/541707

Mentioned in SAL (#wikimedia-operations) [2019-10-09T05:45:10Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool es1013, es1014 T227536 (duration: 01m 00s)

Mentioned in SAL (#wikimedia-operations) [2019-10-10T05:47:40Z] <marostegui> Depool db1084 db1083 db1076 db1118 for PDU maintenance - T227536

Mentioned in SAL (#wikimedia-operations) [2019-10-10T05:51:55Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1112 for PDU maintenance T227536', diff saved to https://phabricator.wikimedia.org/P9294 and previous config saved to /var/cache/conftool/dbconfig/20191010-055153-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-10-10T07:55:12Z] <marostegui> Stop MySQL on es1014 es1013 db1084 db1083 db1077 db1076 db1112 db1124 db1118 for on-site PDU maintenance (this will generate lag on labsdb hosts) - T227536

All the databases, es1013, es1014 and dbproxy1014 are good to go.

dbstore1004 needs to be handled by @elukey, so please coordinate with him before working on that one.

Mentioned in SAL (#wikimedia-operations) [2019-10-10T11:35:56Z] <arturo> icinga downtime cloudvirt1026 for 2h (T227536)

Mentioned in SAL (#wikimedia-operations) [2019-10-10T11:36:33Z] <arturo> icinga downtime cloudvirt1025 for 2h (T227536)

Mentioned in SAL (#wikimedia-operations) [2019-10-10T11:37:16Z] <arturo> icinga downtime cloudvirt1023 for 2h (T227536)

Mentioned in SAL (#wikimedia-cloud) [2019-10-10T11:59:53Z] <arturo> network switch hardware is down affecting cloudvirt1025/1026 (T227536) VMs are supposed to be online but unreachable

Swap is finished everything is back up with redundant power..
updated netbox with for old and new pdu.

Jclark-ctr updated the task description. (Show Details)
Jclark-ctr subscribed.

Change 542321 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] PDUs: add model sentry 4 to eqiad b1 and a2

https://gerrit.wikimedia.org/r/542321

Change 542321 merged by Ayounsi:
[operations/puppet@production] PDUs: add model sentry 4 to eqiad b1 and a2

https://gerrit.wikimedia.org/r/542321

RobH removed RobH as the assignee of this task.