Page MenuHomePhabricator

codfw pdu phase inbalances: audit and correct
Closed, ResolvedPublic

Description

Due to the recent codfw migration, we've placed codfw under normal load conditions. The result is some of the PDU's are complaining about a phase imbalance.

With 3 phase power, please note power comes in on xyz phases, and is split into three banks: xy, yz, xz. The load (which is not the same as number of severs, but is closely related) must be spread evenly between these three power plug banks on each pdu tower.

ps1-a3-codfw
SNMP WARNING - ps1-a3-codfw-infeed-load-tower-A-phase-Z *1588*

The tower A load breakdown breakdown is: X: 4.59 Amps Y: 3.74 Amps Z: 5.70 Amps

So we need to pick a single server on bank XZ and move it to bank XY. The loads are fairly close, so one server may do it. Once it's moved, we can check the balance again. Please list off the ideal server for the move, and we'll review before the actual power cable move.

ps1-c6-codfw
SNMP WARNING - ps1-c6-codfw-infeed-load-tower-A-phase-X *1275*
SNMP WARNING - ps1-c6-codfw-infeed-load-tower-B-phase-X *1362*

Tower A Loads: X 12.71 Y 9.00 Z 8.23
Tower B Loads: X 13.58 Y 8.97 Z 8.40

So these have a very high load on X, and then a slightly higher on Y. It looks like a large server should be relocated off of XY and moved onto YZ for each tower (the same server). The imbalance is quite large, so check and see if perhaps we have new servers in this rack that aren't pulling under load? That would explain it some. Please list off the ideal server for the move, and we'll review before the actual power cable move.

ps1-d6-codfw
SNMP WARNING - ps1-d6-codfw-infeed-load-tower-A-phase-X *1452
SNMP WARNING - ps1-d6-codfw-infeed-load-tower-B-phase-X *1461*

Tower A Loads: X 14.50 Y 8.41 Z 9.42
Tower B Loads: X 14 Y 7 Z 9

So X and Z are high, while Y is lower. I'd pick a single server on XZ and move it to YZ. The imbalance is quite large, so check and see if perhaps we have new servers in this rack that aren't pulling under load? That would explain it some. Please list off the ideal server for the move, and we'll review before the actual power cable move.

These are ideally fixed while they are still under load, as it will result in the most accurate power balance within the racks. However, moving systems under load is dangerous, and should be done carefully.

  • Always ensure both power supply units are working in the system before moving power plugs.
  • Move one power supply at a time, allowing time between plug moves for the power supply to recover and resume providing power.
  • Make sure each power supply for a system plugs into the same bank on each tower. Example (not based in fact): mw2001 is plugged into the xz bank on tower A pdu, it should be plugged into the zx bank on tower B pdu.
  • Do NOT move the server, just add longer power cables to route to the proper bank.

Please work within IRC to announce each system you are moving before you move it carefully, one power supply plug at a time, so it stays online during its power plug migration.

Event Timeline

Since these should be moved while the systems are under load, there is an inherent risk involved.

Please do not move a system's power plugs without coordination in IRC with ops =]

ps1-a3-codfw
moving mw2215

ps1-c6-codfw
moving db2083

ps1-d6-codfw
moving db2063

ps1-a3-codfw
moving mw2215

This is on bank XZ and move it to bank XY?

ps1-c6-codfw
moving db2083

This is on bank XY and we'll move onto YZ?

ps1-d6-codfw
moving db2063

This is on bank XZ and we'll move it to YZ?

mw2215 is on zx
db2083 is on zx
db2063 is on xy

mw2215 is on zx

This one should work. Don't move until we confirm with opsen for that service group.

db2083 is on zx

We need to move a system off XY and moved onto YZ. Please pick a system on bank XY.

db2063 is on xy

We need to move a system off XZ and moved onto YZ. Please pick a system on bank XZ.

mw2215 is on zx (a3)
db2043 is on xy (c6)
db2061 is on xz (d6)

mw2215 is on zx (a3)

I've just checked with @Joe, you can move the power plugs for mw2215 now. Just move them slowly, so one has time to come back online before you unplug the other, and the system can remain online.

The db systems we still need to clear with the DBA team.

Once mw2215 is moved, check and see if icinga clears, and we can both check the PDU load directly.

db2043 is on xy (c6)
db2061 is on xz (d6)

Both of those db hosts are slaves in the s3 and s7 shards, not masters. So it should be ok to move them.

I've added the DBA tag to this task for DBA input on when we can move the power cables for those two systems. They shouldn't result in downtime, but since touching power cables is risky, better safe than sorry.

Don't move these until either @jcrespo or @Marostegui approve.

Ok, mw2215 has been moved, but a3 is still unhappy:

X 14.5 Y 8.3, Z 9.6

So now X is quite high, while Z is back to a more normal rate.

Ok, mw2215 has been moved, but a3 is still unhappy:

X 14.5 Y 8.3, Z 9.6

So now X is quite high, while Z is back to a more normal rate.

Lets try moving another system on a3-codfw from bank xz to bank yz.

Since the X is high, but the others are close, the lower power use systems are best. So ideally if there is a 1U system with fewer disks in xz, pick it for movement to yz and comment in here so I can clear the host for move.

moving msw-a3-codfw from yz to xy

Ok, things shifted drastically on ps1-a3-codfw

X/Y/Z are at 9/9/14.

I've suggested to @Papaul we pick one mw system off xz, and one off yz, and move them both onto xy.

They should be mw appservers or api, not imagescaler or jobrunners.

db2043 is on xy (c6)
db2061 is on xz (d6)

Both of those db hosts are slaves in the s3 and s7 shards, not masters. So it should be ok to move them.

I've added the DBA tag to this task for DBA input on when we can move the power cables for those two systems. They shouldn't result in downtime, but since touching power cables is risky, better safe than sorry.

Don't move these until either @jcrespo or @Marostegui approve.

Let's do this next week as things get more stable and we are sure about everything working fine load-wise on codfw.
@Papaul I will ping you here to decide a day when we can do it
From our side we only need to depool it and downtime it

@Papaul which day works for you to do those two servers? Monday? As @RobH said they shouldn't have downtime, but I would prefer to depool them just in case.

@Marostegui Anytime Monday at 9:30am works for me.

@Marostegui Anytime Monday at 9:30am works for me.

Let's do it Monday at 9:30AM then.

Change 349882 had a related patch set uploaded (by Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2043, db2061

https://gerrit.wikimedia.org/r/349882

Change 349882 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool db2043, db2061

https://gerrit.wikimedia.org/r/349882

Mentioned in SAL (#wikimedia-operations) [2017-04-24T14:14:58Z] <marostegui@naos> Synchronized wmf-config/db-codfw.php: Depool db2043 and db2061 - T163339 (duration: 01m 08s)

Mentioned in SAL (#wikimedia-operations) [2017-04-24T14:17:02Z] <marostegui> Stop MySQL db2043 and db2061 for maintenance - https://phabricator.wikimedia.org/T163339

@Papaul you can do the maintenance on db2043 and db2061 now. They have been depooled. Please let me know when it is done, so I can start MySQL and repool them
Thanks!

Change 349970 had a related patch set uploaded (by Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Repool db2043 and db2061

https://gerrit.wikimedia.org/r/349970

@Marostegui we are clear for db2061

Tower A Loads: X 11.16 Y 8.61 Z 10.46
Tower B Loads: X 11.03 Y 7.93 Z 10.66

no more warning ion Icinga

@Marostegui we are clear for db2061

Tower A Loads: X 11.16 Y 8.61 Z 10.46
Tower B Loads: X 11.03 Y 7.93 Z 10.66

no more warring ion Icinga

Thanks, going to bring db2061 back up.
db2043 will remain stopped until you give me the good to go

Tower A Loads: X 10.27 Y8.14 Z 10.08
Tower B Loads: X10.97 Y 7.91 Z 10.81

no more warning ion Icinga

Change 349970 merged by Marostegui:
[operations/mediawiki-config@master] db-codfw.php: Repool db2043 and db2061

https://gerrit.wikimedia.org/r/349970

Mentioned in SAL (#wikimedia-operations) [2017-04-24T16:58:16Z] <marostegui@naos> Synchronized wmf-config/db-codfw.php: Repool db2043 and db2061 with less weight - T163339 (duration: 01m 16s)

Mentioned in SAL (#wikimedia-operations) [2017-04-24T17:28:36Z] <marostegui@naos> Synchronized wmf-config/db-codfw.php: Increase db2043 and db2061 weight - T163339 (duration: 00m 58s)

@Papaul is this done? I still see a SNMP WARNING - ps1-a3-codfw-infeed-load-tower-A-phase-Z *1233* alert right now.

@faidon a3 is not done yet only c6 and d6

mw2017 plug into ps1-a3 has jut one PSU and the reading on pss1-a3 is higher than ps2-a3 I will like to power that server down to move it from ps1-a3 to ps2-a3 since ps2-a3 has lower readings than ps1-a3

The reading on both PUD's shows

ps1-a3
X=9.73 Y= 9.65 Z=12.84
ps2-a3
X=1.96 Y=9.65 z= 1.97

ps1-a3 is pulling more power than ps2-a3 as @RobH suggested on T163362we we need to check the BIOS settings to each server to see because the readings are way too low on ps2-a3

@faidon

We should be able to balance it while the other tower 2 isnt being used, it will be more difficult but should be possible. I'm just not sure its worth the trouble during codfw being active, when we can fix the bios issue and rebalance more easily when it is fixed.

But if codfw were remaining active for multiple weeks, we should fix it while live. Since we aren't staying in codfw for that long, this could wait for the failover. (but either works!)

I moved the PSU's that are pulling lest power too ps1-a3 and the once pulling more power on ps2-a3 we should be good for now.

@Papaul: That will need to be swapped back once we fix the bios on the machines. Ideally all PSU1 are pulling from PDU1(tower1) and all PSU2 are pulling from PDU2.

So once we fail back to eqiad, and work to fix the bios, since it causes downtime for that, we can simply re-balance power properly.

Papaul lowered the priority of this task from High to Medium.Apr 27 2017, 2:43 PM

@RobH let me know when you want to start working on this. Next week works for me.

Papaul lowered the priority of this task from Medium to Low.Jul 26 2017, 5:31 PM

@RobH @faidon if this is no longer an issue can we resolve it? it has been open for more then a year now.

Thanks.

@Papaul

I'll need you to review the phases on every PDU to know if it is resolved. You can do this by pulling up each one's web interface, or simply walking the rows and auditing the phases.

Can you audit the phases and just paste the output here?

Example:
A1-PS1:X at 5 amps, Y at 5 amps, Z at 4.5 amps
A1-PS2:X at 5 amps, Y at 5 amps, Z at 4.7 amps
A2-PS1:X at 7 amps, Y at 6.9 amps, Z at 7.5 amps
A2-PS2:X at 7.2 amps, Y at 6.8 amps, Z at 7.3 amps

To clarify, this is something that all on-sites should be doing regularly to ensure we don't have any imbalances. Please go ahead and just audit all 20 racks (40 PDUs.) This is something that should be done at least annually. We should all (all on-sites) regularly check/glance at all of these readouts. Anytime there is an imbalance of more than 10% overall between phases, its grossly out of balance. I try to get them as close as possible, depending on the contents of the rack. (Its harder to fully balance a rack of heavy power use 2U machines and disk shelves than it is to balance a rack full of 1U machines.)

Papaul added a subscriber: Papaul.

@RobH here is the audit you requested.

RacksPDUXYY
A1PS14.92.84.8
A1PS24.32.44.7
A2PS16.51.46.5
A2PS23.01.42.4
A3PS14.56.06.0
A3PS22.32.02.7
A4PS15.02.46.7
A4PS21.41.01.4
A5PS17.36.56.4
A5PS23.63.33.6
A6PS15.64.03.3
A6PS21.41.50.6
A7PS13.90.54.3
A7PS21.60.41.8
A8PS15.03.04.5
A8PS25.52.54.5
B1PS16.13.45.8
B1PS24.42.04.8
B2PS16.01.16.0
B2PS22.60.92.6
B3PS14.92.97.4
B3PS21.51.61.7
B4PS12.76.14.1
B4PS22.63.01.4
B5PS17.06.54.7
B5PS24.43.22.9
B6PS16.03.24.7
B6PS24.52.43.9
B7PS13.70.74.3
B7PS22.90.73.5
B8PS14.71.55.3
B8PS23.50.93.9
C1PS14.62.65.1
C1PS22.62.22.1
C2PS15.90.86.0
C2PS22.40.32.7
C3PS15.04.94.7
C3PS24.24.64.3
C4PS16.34.77.1
C4PS24.64.63.8
C5PS16.64.98.2
C5PS22.43.52.2
C6PS15.15.45.9
C6PS26.34.87.0
C7PS13.90.44.2
C7PS21.60.41.8
C8PS13.91.95.2
C8PS24.32.05.6
D1PS15.41.36.0
D1PS22.20.92.5
D2PS18.31.08.8
D2PS24.31.04.7
D3PS14.40.44.5
D3PS21.50.41.2
D4PS14.90.45.0
D4PS21.40.31.1
D5PS14.23.76.9
D5PS22.11.53.4
D6PS18.26.38.0
D6PS28.75.87.9
D7PS15.90.56.2
D7PS22.50.42.8
D8PS11.30.71.1
D8PS20.80.30.6

@Papaul: Anywhere that the phases are more than 4% out of balance from the lowest phase, they need to be rebalanced. This is something that you, as the on-site engineer for codfw, will need to review and balance every time you are adding or removing anything from the rack, as well as ensuring they are balanced overall.

Since the majority of our devices (other than msws and mr1) are redundant power, this is something that should be able to be accomplished with no downtime. Lets take rack A2 as an example:

rackPSUXYZ
A2PS16.51.46.5
A2PS23.01.42.4

Those phases are quite out of balance, and should be within 4-5% of one another (and the lowest power draw.) In this case, it appears that phase Y is underutilized compared to the other phases. You need to move items off the XZ phase bank and split them evenly on the XY and YZ banks for both PS1 and PS2.

Does this make sense? Please let me know if not, and if it does, please go ahead and attempt to re-balance A2 to start! Once that one is fixed, we can review and you can move onto the other racks.

If the phases are this far out of balance, and we end up losing power or cycling power on an entire rack, the phase imbalance and lead to load spikes on the overloaded phases.

@RobH
Better to do this when I am back from vacation. Thanks.

@RobH
Better to do this when I am back from vacation. Thanks.

Absolutely, no reason to go mucking about for a day before leaving! This is set to low priority already, as it is more 'general housekeeping' for onsite engineers, rather than an urgent task for immediate completion.

Once you are back, you can start balancing these. Let me know when and I'm happy to remotely assist as best I can (though most of this is the case of moving a single cable and waiting a few minutes to see power states normalize across phases.)

RobH renamed this task from pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw to codfw pdu phase inbalances: audit and correct.Jul 17 2018, 4:01 PM