Page MenuHomePhabricator

b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC)
Closed, ResolvedPublic

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack B6-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

  • - schedule downtime for the entire list of switches and servers.
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower

List of routers, switches, and servers

deviceroleSRE team coordination
asw2-b6-eqiad
mc1027mc hostserviceops ok to go with patch ready for easy removal
mc1026mc hostserviceops ok to go with patch ready for easy removal
mc1025mc hostserviceops ok to go with patch ready for easy removal
mc1024mc hostserviceops ok to go with patch ready for easy removal
cloudvirt1029hypervisor@aborrero - good to go.
scb1002scb clusterserviceops - if poweroff, run depool as root first
wdqs1009@Gehel good to go
aqs1008Analyticsfine to do any time
kubernetes1002kubernetes nodepoweroff/poweron serviceops node must be put out of rotation
puppetmaster1001puppetmaster frontendhttps://wikitech.wikimedia.org/wiki/Puppet#Pool_/_depool_a_frontend @jbond @herron @colewhite
elastic1028elastic@Gehel good to go
elastic1047elastic@Gehel good to go
elastic1046elastic@Gehel good to go
mw1306mw jobrunnerserviceops - good to go
mw1305mw jobrunnerserviceops - good to go
mw1304mw jobrunnerserviceops - good to go
mw1303mw jobrunnerserviceops - good to go
mw1302mw jobrunnerserviceops - good to go
mw1301mw jobrunnerserviceops - good to go
mw1300mw jobrunnerserviceops - good to go
mw1299mw jobrunnerserviceops - good to go
mw1298mw jobrunnerserviceops - good to go
mw1297mw jobrunnerserviceops - good to go
mw1296mw jobrunnerserviceops - good to go
mw1295mw jobrunnerserviceops - good to go
mw1294mw jobrunnerserviceops - good to go
mw1293mw jobrunnerserviceops - good to go
thumbor1002thumbor hostdepool setting pooled=inactive serviceops
thumbor1001thumbor hostdepool setting pooled=inactive serviceops
mw1290mw API hostserviceops - good to go
mw1289mw API hostserviceops - good to go
mw1288mw API hostserviceops - good to go
mw1287mw API hostserviceops - good to go
mw1286mw API hostserviceops - good to go
mw1285mw API hostserviceops - good to go
mw1284mw API hostserviceops - good to go

Event Timeline

RobH renamed this task from b5-eqiad pdu refresh to b6-eqiad pdu refresh.Jul 8 2019, 10:46 PM
RobH created this task.
RobH updated the task description. (Show Details)
wiki_willy renamed this task from b6-eqiad pdu refresh to b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC).Aug 15 2019, 5:37 PM
colewhite updated the task description. (Show Details)
colewhite added subscribers: jbond, herron.
colewhite subscribed.
RobH removed RobH as the assignee of this task.Aug 28 2019, 6:17 PM
RobH added a subscriber: Joe.

Removing myself as assignee since this has all the servers populated in the task description.

mw and thumbor hosts @Joe stated he would add a followup comment in last weeks meeting.

Joe updated the task description. (Show Details)
jbond triaged this task as Medium priority.Sep 9 2019, 9:15 AM

Mentioned in SAL (#wikimedia-operations) [2019-09-10T11:20:54Z] <cmjohnson1> swapping the PDU in rack B6 eqiad T227541

The PDU has been swapped and the new pdus are in netbox. @RobH can you help with the setup for serial console please.

Change 536140 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] facilities: ps1-b6-eqiad replaced with newer PDU

https://gerrit.wikimedia.org/r/536140

Change 536140 merged by Filippo Giunchedi:
[operations/puppet@production] facilities: ps1-b6-eqiad replaced with newer PDU

https://gerrit.wikimedia.org/r/536140

Trying to figure out why this is failing: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ps1-b6-eqiad
error is:

External command error: Error in packet
Reason: (noSuchName) There is no such variable name in this MIB.
Failed object: iso.3.6.1.4.1.1718.3.2.2.1.7.1.1

My first question is, is there a tower B ?

I don't see one on https://librenms.wikimedia.org/device/device=50/ while for example https://librenms.wikimedia.org/device/device=40/ does have one
Netbox says that there is a ps2.

or is it a bug/miss configuration?

Checked with @Cmjohnson , who says he'll follow up to check the connections.

15:15 <@RobH> : So, I can confirm in librenms it sees both towers
15:15 <@RobH> : so, this seems to me to be an icinga issue
15:15 <@RobH> : Does this seem reasonable? If so, we need to likely involve someone with some icinga knowledge.

Please note this is an issue that is happening on ALL the new PDUs. I'll update the parent task.

Please note that when I compare librenms output it seems like it sees both towers right now:

ps1-b6-eqiad:
https://librenms.wikimedia.org/device/device=50/
ps1-a4-eqiad:
https://librenms.wikimedia.org/device/device=40/

Both show voltages for AA phases and BB phases (that is the 3 phases per tower).

I think this is ok to resolve, but I'm not 100% that I am looking at the same thing as @ayounsi.

It was a PDU miss-configuration and a monitoring issue. Was solved in https://phabricator.wikimedia.org/T229328

Thanks for confirming @ayounsi Resolving task.