Page MenuHomePhabricator

b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC)
Closed, ResolvedPublic

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack B3-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

  • - schedule downtime for the entire list of switches and servers.
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower

List of routers, switches, and servers

deviceroleSRE team coordination
asw2-b3-eqadasw@ayounsi
labnet1001spare@Bstorm - it's a spare; do your worst
promethiumdecomServer is getting decommed in T191362
ms-be1031ms-be@fgiunchedi - can shutdown and power up cleanly for maint window
labnodepool1001spare@Bstorm - it's a spare; do your worst
db1104db@Marostegui to depool it before the maintenance
db1073dbHost to be decommissioned T231892
restbase1022restbaseserviceops will depool it
db1086db@Marostegui to depool it before the maintenance
db1085db@Marostegui to depool it before the maintenance
elastic1039cirrus-search@Gehel good to go
elastic1038cirrus-search@Gehel good to go
elastic1037cirrus-search@Gehel good to go
elastic1036cirrus-search@Gehel good to go
analytics1051Analyticsfine to do any time
analytics1050Analyticsfine to do any time
analytics1049Analyticsfine to do any time
WMF5174
analytics1048Analyticsfine to do any time
analytics1047Analyticsfine to do any time
analytics1046Analyticsfine to do any time
cloudvirt1027hypervisor@aborrero - this is ready to go, important VMs reallocated
db1130db@Marostegui to depool it before the maintenance
stat1007Analyticsfine to do any time
puppetmaster1003puppetmaster@jbond @colewhite @herron may need depooled: https://wikitech.wikimedia.org/wiki/Puppet#Pool_/_depool_a_backend

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2019-07-24T10:12:55Z] <arturo> reallocating tools-docker-registry-04 from cloudvirt1027 to cloudvirt1028 (T227539)

Mentioned in SAL (#wikimedia-cloud) [2019-07-24T10:14:31Z] <arturo> reallocating tools-puppetmaster-01 from cloudvirt1027 to cloudvirt1028 (T227539)

Mentioned in SAL (#wikimedia-cloud) [2019-07-24T10:15:49Z] <arturo> reallocating proxy-02 from cloudvirt1027 to cloudvirt1028 (T227539)

aborrero added a subscriber: aborrero.

This is ready to go on our side, hopefully today :-)

Marostegui added a subscriber: Marostegui.

From the DBA side, it is good to. db1073 is a master for m5 (wikitech, nova...) cloud-services-team needs to decide if they can afford a downtime there.

From the DBA side, it is good to. db1073 is a master for m5 (wikitech, nova...) cloud-services-team needs to decide if they can afford a downtime there.

yes, since this DB is redundant, we are ready to go regarding db1073.

Marostegui added subscribers: jcrespo, mark, faidon.

db1104 is s8 primary master, we'd probably need to failover this host if we are not confident this host can be swapped over without downtime.
@mark @faidon what do you guys thing? another possibility is to try to do it without switchover but be ready from the DB point of view to fail it over in case it goes down (cc @jcrespo)

wiki_willy renamed this task from b3-eqiad pdu refresh to b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC).Aug 15 2019, 5:35 PM

db1104 is s8 primary master, we'd probably need to failover this host if we are not confident this host can be swapped over without downtime.
@mark @faidon what do you guys thing? another possibility is to try to do it without switchover but be ready from the DB point of view to fail it over in case it goes down (cc @jcrespo)

@wiki_willy any advice on this comment?

Gehel added a subscriber: Gehel.

@Marostegui - I'll defer to Faidon or Mark for their opinion, but my suggestion is to go ahead and fail out in advance if it's not too much of a hassle. The success rate of us upgrading PDUs without any issues is pretty good, but unexpected accidents can occur, and master DBs are very critical to the infrastructure.

@Marostegui - I'll defer to Faidon or Mark for their opinion, but my suggestion is to go ahead and fail out in advance if it's not too much of a hassle. The success rate of us upgrading PDUs without any issues is pretty good, but unexpected accidents can occur, and master DBs are very critical to the infrastructure.

Thanks - I will try to get them scheduled.
It is not a hassle, but it is something that requires some planning ahead, mediawiki read-only time, coordination with other teams etc.

RobH removed RobH as the assignee of this task.Aug 28 2019, 6:42 PM
jbond triaged this task as Medium priority.Sep 9 2019, 9:15 AM
Bstorm added a subscriber: Bstorm.

@Cmjohnson - good to go for tomorrow's PDU upgrade, but please confirm with @Marostegui before you start that DBs have been depooled. Thanks, Willy

I will comment here once the DBs have been depooled tomorrow, I will do it a bit before the scheduled maintenance scheduled time.

Thanks!

elukey updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2019-09-17T09:29:02Z] <marostegui> Downtime db1073 db1130 db1104 db1085 db1086 for the PDU maintenance T227539

Mentioned in SAL (#wikimedia-operations) [2019-09-17T09:46:09Z] <marostegui> Depool and stop replication on db1130 db1104 db1085 db1086 (lag will appear on s6 on labsdb) for PDU maintenance - T227539

Mentioned in SAL (#wikimedia-operations) [2019-09-17T09:48:28Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool and stop replication on db1130 db1104 db1085 db1086 (lag will appear on s6 on labsdb) for PDU maintenance - T227539', diff saved to https://phabricator.wikimedia.org/P9116 and previous config saved to /var/cache/conftool/dbconfig/20190917-094827-marostegui.json

All the DBs have been downtimed, depooled and replication has been stopped. From the DBAs point of view, this maintenance is good to go.

Mentioned in SAL (#wikimedia-operations) [2019-09-17T11:24:00Z] <cmjohnson1> commencing pdu swap rack b3 eqiad T227539

Mentioned in SAL (#wikimedia-operations) [2019-09-17T13:02:55Z] <marostegui> Start replication on db1130 db1104 db1085 db1086 after PDU maintenance is completed - T227539

Mentioned in SAL (#wikimedia-operations) [2019-09-17T13:21:05Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db1130 db1104 db1085 db1086 after PDU maintenance - T227539', diff saved to https://phabricator.wikimedia.org/P9117 and previous config saved to /var/cache/conftool/dbconfig/20190917-132102-marostegui.json

Jclark-ctr updated the task description. (Show Details)
Jclark-ctr added a subscriber: Jclark-ctr.

Finished swapping pdu reassigned to @RobH

I've gone ahead and setup remote access and settings identical to the other new PDUs. It now is online/ping/ssh/syslog accessible.

The Netbox/LibreNMS check is not happy: https://netbox.wikimedia.org/extras/reports/librenms.LibreNMS/
Did Netbox get updated with the new serial?

Someone onsite needs to enter the new device into netbox still, as the old devices are all that are in netbox at this time.

Clarification: https://netbox.wikimedia.org/dcim/devices/1394/ is the OLD ps1-b3-eqiad that should have its hostname set to asset tag, and then set to offline state as its unracked.

Then the new PDUs (already likely in netbox) need to be updated to this rack and assigned/hostname/put active.

The on-sites need to do this, since the on-sites can tell what the asset tag and serials are for both towers most easily.

@Jclark-ctr - can you wrap up the netbox entries on this one, and then close out the task? Thanks, Willy

updated netbox with new pdu`s