Page MenuHomePhabricator

b1-eqiad pdu refresh (Thursday 10/10 @11am UTC)
Closed, ResolvedPublic

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack B1-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

  • - schedule downtime for the entire list of switches and servers.
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower

List of routers, switches, and servers

deviceroleSRE team coordinationnotes
asw2-b1-eqiadasw@ayounsi
es1014esDBA@Marostegui to depool it before the maintenance
es1013esDBA@Marostegui to depool it before the maintenance
ms-be1022ms-be@fgiunchedipoweroff / poweron
db1084dbDBA@Marostegui to depool it before the maintenance
db1083dbDBA@Marostegui to depool it before the maintenance
kafka-jumbo1003Analyticsfine to do any time
db1077dbDBAtest host, nothing to be done
db1076dbDBA@Marostegui to depool it before the maintenance
db1112dbDBA@Marostegui to depool it before the maintenance
logstash1011@fgiunchediok with power loss, nice to have: disable es replication
cloudvirt1026cloudvirtcloud-services-team@aborrero: running 39 VMs, please handle with care
cloudvirt1025cloudvirtcloud-services-team@aborrero: running 46 VMs, please handle with care
dbstore1004dbstoreAnalytics
cloudvirt1023cloudvirtcloud-services-team@aborrero: OK, not running any VM.
an-coord1001Analyticsfine to do any time but please ping Analytics first
dbproxy1014dbproxyDBAnothing to be done, not active
authdna1001authdnsTraffic
db1124dbDBAsanitarium host, nothing to be done
snapshot1008@ArielGlenn
db1118dbDBA@Marostegui to depool it before the maintenance
wdqs1007wdqsDiscovery-Search@Gehel good to go

Event Timeline

RobH updated the task description. (Show Details)
RobH added subscribers: ayounsi, ArielGlenn, fgiunchedi.

Adding @hoo because wikidata entity dumps will be impacted.

elukey added a subscriber: elukey.

Some heads up could be good for me to gracefully stop daemons on an-coord1001. For kafka-jumbo1003 it is fine if it doesn't risk to loose power together with other kafka-jumbo nodes (2 down are tolerable, more probably not).

Marostegui added a subscriber: Marostegui.

From the DB side this can be done anytime

RobH removed RobH as the assignee of this task.Aug 14 2019, 4:51 PM
wiki_willy renamed this task from b1-eqiad pdu refresh to b1-eqiad pdu refresh (Thursday 10/10 @11am UTC).Aug 15 2019, 5:33 PM
CDanis triaged this task as Medium priority.Aug 16 2019, 1:01 PM
Gehel added a subscriber: Gehel.

Change 541707 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool es1013, es1014

https://gerrit.wikimedia.org/r/541707

Change 541707 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool es1013, es1014

https://gerrit.wikimedia.org/r/541707

Mentioned in SAL (#wikimedia-operations) [2019-10-09T05:45:10Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool es1013, es1014 T227536 (duration: 01m 00s)

Mentioned in SAL (#wikimedia-operations) [2019-10-10T05:47:40Z] <marostegui> Depool db1084 db1083 db1076 db1118 for PDU maintenance - T227536

Mentioned in SAL (#wikimedia-operations) [2019-10-10T05:51:55Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1112 for PDU maintenance T227536', diff saved to https://phabricator.wikimedia.org/P9294 and previous config saved to /var/cache/conftool/dbconfig/20191010-055153-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-10-10T07:55:12Z] <marostegui> Stop MySQL on es1014 es1013 db1084 db1083 db1077 db1076 db1112 db1124 db1118 for on-site PDU maintenance (this will generate lag on labsdb hosts) - T227536

All the databases, es1013, es1014 and dbproxy1014 are good to go.

dbstore1004 needs to be handled by @elukey, so please coordinate with him before working on that one.

Mentioned in SAL (#wikimedia-operations) [2019-10-10T11:35:56Z] <arturo> icinga downtime cloudvirt1026 for 2h (T227536)

Mentioned in SAL (#wikimedia-operations) [2019-10-10T11:36:33Z] <arturo> icinga downtime cloudvirt1025 for 2h (T227536)

Mentioned in SAL (#wikimedia-operations) [2019-10-10T11:37:16Z] <arturo> icinga downtime cloudvirt1023 for 2h (T227536)

Mentioned in SAL (#wikimedia-cloud) [2019-10-10T11:59:53Z] <arturo> network switch hardware is down affecting cloudvirt1025/1026 (T227536) VMs are supposed to be online but unreachable

Swap is finished everything is back up with redundant power..
updated netbox with for old and new pdu.

Jclark-ctr updated the task description. (Show Details)
Jclark-ctr added a subscriber: Jclark-ctr.

Change 542321 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] PDUs: add model sentry 4 to eqiad b1 and a2

https://gerrit.wikimedia.org/r/542321

Change 542321 merged by Ayounsi:
[operations/puppet@production] PDUs: add model sentry 4 to eqiad b1 and a2

https://gerrit.wikimedia.org/r/542321

RobH removed RobH as the assignee of this task.