Page MenuHomePhabricator

a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC)
Closed, ResolvedPublic

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack A6-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

  • - schedule downtime for the entire list of switches and servers.
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - confirm serial works to the new PDU (it does not as of 2019-10-22 @ 17:08 GMT)
  • - setup PDU following directions on https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/ServerTech#Initial_Setup
  • - update PDU model in puppet per T233129.

List of routers, switches, and servers

deviceroleSRE team coordinationnotes
asw2-a6-eqiadasw@ayounsi
pc1007parsercacheDBAcan be failed over easily @Marostegui to depool this host
wtp1027parsoidserviceopsfine to do at any time
wtp1026parsoidserviceopsfine to do at any time
wtp1025parsoidserviceopsfine to do at any time
an-master1001Analyticsfine to do any time
dbproxy1013dbproxyDBAnot active
elastic1045cirrus-searchDiscovery-Search@Gehel good to go
elastic1044cirrus-searchDiscovery-Search@Gehel good to go
elastic1048cirrus-searchDiscovery-Search@Gehel good to go
mc1023mcserviceops @elukeyfine to do at any time outside of deployment windows
mc1022mcserviceops @elukeyfine to do at any time outside of deployment windows
mc1021mcserviceops @elukeyfine to do at any time outside of deployment windows
mc1020mcserviceops @elukeyfine to do at any time outside of deployment windows
mc1019mcserviceops @elukeyfine to do at any time outside of deployment windows
aqs1007Analyticsfine to do any time
weblog1001fine to do any time but it may disrupt some webrequest monitoring that we rely on, Cc: @godog
restbase1021restbase@jijikiok with power loss
labsdb1012labsdbAnalyticsAnalytics to confirm if MySQL can be stopped
db1066dbDBAHost powered off, DO NOT POWER ON - pending on-site decommissioning steps T233071
db1116dbDBAbackup source, nothing to be done
db1115dbDBAtendril host, nothing to be done
labmon1002labmoncloud-services-teamcan be done anytime
druid1004Analyticsfine to do any time
wdqs1004wdqsDiscovery-Search@Gehel good to go
ores1001ores@akosiarisfine to do at any time
restbase-dev1004can be done at any time
cloudcontrol1003openstack control nodecloud-services-teamcan be done at any time
mw1312mwserviceopsfine to do at any time outside of deployment windows
mw1311mwserviceopsfine to do at any time outside of deployment windows
mw1310mwserviceopsfine to do at any time outside of deployment windows
mw1309mwserviceopsfine to do at any time outside of deployment windows
mw1308mwserviceopsfine to do at any time outside of deployment windows
mw1307mwserviceopsfine to do at any time outside of deployment windows
ganeti1006ganeti node@akosiariswill need to be emptied in advance
db1096dbDBA@Marostegui to depool this host

Event Timeline

Analytics side: if possible I'd need some heads up to force a failover for an-master1001.

Memcached side: we have 5 mc10XX shards in the same rack, loosing all of them could be a big problem with the current configuration of mcrouter. Explicitly adding @Joe and @jijiki to understand how to handle this.

akosiaris added a subscriber: MoritzMuehlenhoff.

sudo gnt-node migrate -f ganeti1006

This rack contains an active primary db master: db1066, this would need to be failed over if we are not confident about not losing power.

RobH removed RobH as the assignee of this task.Aug 14 2019, 4:53 PM
wiki_willy renamed this task from a6-eqiad pdu refresh to a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC).Aug 15 2019, 5:31 PM
CDanis triaged this task as Medium priority.Aug 16 2019, 1:02 PM

@Marostegui - I would say just go for it and fail out in advance, if it's not too much trouble. Master DBs are very critical, so my opinion is to just take the extra precautionary measures. Thanks, Willy

I will get them scheduled, planned etc. Thanks

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

@elukey for labsdb1012 your Team would need to let us know if MySQL can be stopped for this maintenance (just in case there is powerloss, better to have MySQL stopped, as labs hosts do not have GTID enabled and the risk of corruption can be higher).

@elukey for labsdb1012 your Team would need to let us know if MySQL can be stopped for this maintenance (just in case there is powerloss, better to have MySQL stopped, as labs hosts do not have GTID enabled and the risk of corruption can be higher).

We can definitely stop mysql on it, we need labsdb up and running for jobs at the beginning of the month :)

I also added the info about analytics hosts and flipped the requirement of depooling for memcached to "no", since we should do it only if things go on fire :)

@elukey for labsdb1012 your Team would need to let us know if MySQL can be stopped for this maintenance (just in case there is powerloss, better to have MySQL stopped, as labs hosts do not have GTID enabled and the risk of corruption can be higher).

We can definitely stop mysql on it, we need labsdb up and running for jobs at the beginning of the month :)

Excellent, thank you. Let's stop replication + mysql then.

Change 542890 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Temporary pool pc1010 in pc1

https://gerrit.wikimedia.org/r/542890

Change 542890 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Temporary pool pc1010 in pc1

https://gerrit.wikimedia.org/r/542890

Mentioned in SAL (#wikimedia-operations) [2019-10-22T06:43:11Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool pc1010 T227142 (duration: 00m 52s)

Mentioned in SAL (#wikimedia-operations) [2019-10-22T07:53:49Z] <marostegui> Stop MySQL on db1116 pc1007 db1096:3315, db1096:3316 for PDU maintenance T227142

Mentioned in SAL (#wikimedia-operations) [2019-10-22T08:05:40Z] <marostegui> Stop MySQL on labsdb1012 for PDU work T227142

The following hosts are ready for this maintenance

  • pc1007
  • labsdb1012
  • db1116
  • db1096
  • dbproxy1013
  • db1066. Note this host is powered OFF as it is ready to be decommissioned, do not power it back on

Pending: db1115 which will be confirmed by @jcrespo when ready to proceed.

Mentioned in SAL (#wikimedia-operations) [2019-10-22T10:32:26Z] <jynus> shutting down db1115 in preparation for PDU maintanance, this will make tendril and dbtree unavailable for 2 hours T227142

db1115 is now down, I took the opportunity to upgrade all its system packages, but didn't touch mariadb.

finished PDU Maintenance . Netbox updated with new PDU

Mentioned in SAL (#wikimedia-operations) [2019-10-22T12:25:56Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool pc1007 after PDU maintenance T227142 (duration: 00m 50s)

Change 545337 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] setting new pdu models

https://gerrit.wikimedia.org/r/545337

Change 545337 merged by RobH:
[operations/puppet@production] setting new pdu models

https://gerrit.wikimedia.org/r/545337

RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)

@wiki_willy requested I step in and setup the software side of things, but cannot do so as serial to this PDU isn't currently working.

Can you troubleshoot the serial connection please? (You should be able to login to the scs console and see if it works, you can ping me and I can teach you how to do this if you like!)

the icinga downtime was set to expire in less than an hour, so I've extended it until 2300 GMT.

ps1-a6-eqiad is shown as down in icinga, I believe that is expected?

Hi @jijiki - I think there are a couple things that @Jclark-ctr needs to check and resolve, before @RobH can configure it. After that, the alert should go away. Thanks, Willy

Mentioned in SAL (#wikimedia-operations) [2019-10-24T18:03:39Z] <robh> setting ip info for ps1-a6-eqiad, it is rebooting. T227142

Mentioned in SAL (#wikimedia-operations) [2019-10-24T18:20:04Z] <robh> ps1-a6-eqiad setup complete, icinga errors should clear up T227142

RobH removed RobH as the assignee of this task.Oct 24 2019, 6:44 PM