Page MenuHomePhabricator

b8-eqiad pdu refresh (Thursday 10/31 @11am UTC)
Closed, ResolvedPublic

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack B8-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

  • - schedule downtime for the entire list of switches and servers.
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - connect via serial / confirm serial connection works
  • - setup PDU following directions on https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/ServerTech#Initial_Setup
  • - update PDU model in puppet per T233129.

List of routers, switches, and servers

deviceroleSRE team coordinationrecommended action during maintainance
asw-b8-eqiadasw@ayounsiensure this doesn't go offline as it will take entire rack network offline
ganeti1018ganeti hostserviceopsneeds to be emptied of VMs before
gerrit1001sparefine to do at anytime
cloudvirt1030hypervisorcloud-services-teamLots of VMs, please handle with care.
db1132m2 masterDBAThis host is m2 master which holds some internal services, ensure it doesn't go offline, if it does, there is an automatic failover via proxies.
pc1008parsercache hostDBADBA to depool it
restbase1024restbaseserviceops, Servicesfine to do at anytime
an-master1002Analyticsfine to do any time
dbproxy1015db proxyDBANot in use
graphite1004@fgiunchedino action needed, if power is lost and can't be restored quickly we'll switch to codfw
rdb1009redis masterserviceopsthis will need coordination?
notebook1003
db1119db hostDBADBA to depool it
db1113db hostDBADBA to depool it
cloudservices1003DNScloud-services-teamfine to do at anytime
mwmaint1002This is the primary mw maint system in eqiad, perhaps we should halt deployments during this time?
labpuppetmaster1001sparecloud-services-teamGood to go. Host is being decommissioned.
ores1004ORESserviceopsfine do to at any time
wtp1036parsoidserviceopsfine to do at any time
wtp1035parsoidserviceopsfine to do at any time
wtp1034parsoidserviceopsfine to do at any time
dumpsdata1001dumps data server@ArielGlenncoordinate please
analytics1063Analyticsfine to do any time
analytics1062Analyticsfine to do any time
analytics1061Analyticsfine to do any time

Event Timeline

wiki_willy renamed this task from b8-eqiad pdu refresh to b8-eqiad pdu refresh (Thursday 10/31 @11am UTC).Aug 15 2019, 5:39 PM
RobH updated the task description. (Show Details)
RobH added subscribers: ayounsi, Nuria, ArielGlenn.
RobH added a subscriber: akosiaris.
RobH removed RobH as the assignee of this task.Aug 28 2019, 6:30 PM
RobH triaged this task as High priority.
RobH updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-cloud) [2019-10-31T11:01:01Z] <arturo> icinga-downtimed cloudvirt1030 and cloudservices1003 for 1h due to PDU upgrade operations T227543

Change 547508 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: depool pc1008 temporarily

https://gerrit.wikimedia.org/r/547508

Mentioned in SAL (#wikimedia-operations) [2019-10-31T11:37:01Z] <jynus@cumin1001> dbctl commit (dc=all): 'Depool db1119, db1113 T227543', diff saved to https://phabricator.wikimedia.org/P9507 and previous config saved to /var/cache/conftool/dbconfig/20191031-113659-jynus.json

Change 547508 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: depool pc1008 temporarily

https://gerrit.wikimedia.org/r/547508

Mentioned in SAL (#wikimedia-operations) [2019-10-31T11:43:33Z] <jynus@deploy1001> Synchronized wmf-config/db-eqiad.php: depooling pc1008 T227543 (duration: 01m 01s)

finished pdu refresh, netbox updated,

Mentioned in SAL (#wikimedia-operations) [2019-10-31T13:16:07Z] <jynus@cumin1001> dbctl commit (dc=all): 'Repool db1119, db1113 at 10% T227543', diff saved to https://phabricator.wikimedia.org/P9509 and previous config saved to /var/cache/conftool/dbconfig/20191031-131606-jynus.json

Mentioned in SAL (#wikimedia-operations) [2019-10-31T13:21:19Z] <jynus@deploy1001> Synchronized wmf-config/db-eqiad.php: repool pc1008 T227543 (duration: 01m 02s)

Please note the serial connection for ps1-b8-eqiad is non-functional at this time.

7 $> ssh root@scs-a8-eqiad.mgmt.eqiad.wmnet
Password: 
# pmshell

 1: ps1-a1-eqiad                                  2: ps1-a2-eqiad                                  3: ps1-a3-eqiad                                  4: ps1-a4-eqiad                                 
 5: ps1-a5-eqiad                                  6: ps1-a6-eqiad                                  7: ps1-a7-eqiad                                  8: ps1-a8-eqiad                                 
 9: ps1-b1-eqiad                                 10: ps1-b2-eqiad                                 11: ps1-b3-eqiad                                 12: ps1-b4-eqiad                                 
13: ps1-b5-eqiad                                 14: ps1-b6-eqiad                                 15: ps1-b7-eqiad                                 16: ps1-b8-eqiad                                 
17: asw-a1-eqiad                                 18: asw-a2-eqiad                                 19: asw-a3-eqiad                                 20: asw-a4-eqiad                                 
21: asw-a5-eqiad                                 22: asw-a6-eqiad                                 23: asw-a7-eqiad                                 24: asw-a8-eqiad                                 
25: asw-b1-eqiad                                 26: asw-b2-eqiad                                 27: asw-b3-eqiad                                 28: asw-b4-eqiad                                 
29: asw-b5-eqiad                                 30: asw-b6-eqiad                                 31: asw-b7-eqiad                                 32: asw-b8-eqiad                                 
33: re0.cr1-eqiad                                34: re1.cr1-eqiad                                35: re0.cr2-eqiad                                36: re1.cr2-eqiad                                
37: mr1-eqiad                                    40: msw1-eqiad                                   41: asw2-a5-eqiad                                45: asw2-a3-eqiad                                

Connect to port > 16

When I hit enter, it should prompt for the login, but does not.

This needs to be fixed by on-sites.

Cable reseated (clip was bent) by @Jclark-ctr - reassigning back to @RobH for configuration.

Reseated cable fixed bent clip.

# pmshell

 1: ps1-a1-eqiad    2: ps1-a2-eqiad    3: ps1-a3-eqiad    4: ps1-a4-eqiad
 5: ps1-a5-eqiad    6: ps1-a6-eqiad    7: ps1-a7-eqiad    8: ps1-a8-eqiad
 9: ps1-b1-eqiad   10: ps1-b2-eqiad   11: ps1-b3-eqiad   12: ps1-b4-eqiad
13: ps1-b5-eqiad   14: ps1-b6-eqiad   15: ps1-b7-eqiad   16: ps1-b8-eqiad
17: asw-a1-eqiad   18: asw-a2-eqiad   19: asw-a3-eqiad   20: asw-a4-eqiad
21: asw-a5-eqiad   22: asw-a6-eqiad   23: asw-a7-eqiad   24: asw-a8-eqiad
25: asw-b1-eqiad   26: asw-b2-eqiad   27: asw-b3-eqiad   28: asw-b4-eqiad
29: asw-b5-eqiad   30: asw-b6-eqiad   31: asw-b7-eqiad   32: asw-b8-eqiad
33: re0.cr1-eqiad  34: re1.cr1-eqiad  35: re0.cr2-eqiad  36: re1.cr2-eqiad
37: mr1-eqiad      40: msw1-eqiad     41: asw2-a5-eqiad  45: asw2-a3-eqiad

Connect to port > 16

Sentry Smart PDU Version 8.0n

Username:

Mentioned in SAL (#wikimedia-operations) [2019-10-31T21:25:13Z] <robh> setting up ps1-b8-eqiad per T227543. it will reboot twice in the next 15 minutes, and then should start to clear up in icinga

Change 547657 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] ps1-b8-eqiad update for monitoring

https://gerrit.wikimedia.org/r/547657

Change 547657 merged by RobH:
[operations/puppet@production] ps1-b8-eqiad update for monitoring

https://gerrit.wikimedia.org/r/547657

RobH removed RobH as the assignee of this task.
RobH updated the task description. (Show Details)

All green in icinga and calling in normally, resolved.