Page MenuHomePhabricator

LoadMonitor connection weighting reimagined
Closed, ResolvedPublic

Description

When a DB server becomes slow for some reason (e.g. T313197, T311106) MediaWiki continues to give it the configured proportion of new connections. This continues until it has 10,000 connections waiting for it. Then connection errors start to occur, and they continue until the LoadMonitor weight (see dashboard) declines. It's unclear if LoadMonitor weight has ever declined sufficiently during an incident to make a difference.

Effectively, a single slow server will suck up all PHP-FPM workers, and if the number of workers is not sufficient, the site will go down.

I propose adjusting the load by looking at the number of connections from the current client host to each potential DB server. We can store connection counts in a local data store such as APCu.

Objectives:

  • If all servers have zero connections, the rate of new connections should reflect the configured loads.
  • When there are a large number of connections, new connections should be allocated so as to make the connection count ratios match the configured loads.
  • A small deviation should cause a small response. MW shouldn't completely depool a server with one connection because the other servers have zero connections.
  • If all connection attempts to a given server fail fast with an error, so that the number of open connections is approximately zero, MW shouldn't send all traffic to that server.

Proposal:

Phabricator doesn't do maths, so I put an idea at mw:User:Tim Starling (WMF)/LoadBalancer connection metric. Basically you have a load adjustment which is an absolute percentage, e.g. if the original load is 10% and the adjustment is 10% then you end up with 20%, and then rescale to make all the loads add up to 100%. You take the connection count difference between reality and the model, and scale it down by a tunable parameter, and then scale it down again as the total number of connections increases.

Treat a connection failure the same as a connection held for a long time. Use WRStats with an APCu backend to store the connection failure count over a sliding time window.

Increment and decrement active connection counts in APCu. Apply a sliding time window to that too so that any counter drift is rectified after a few minutes.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I definitely think it's a good idea. Currently, LoadMonitor has never prevented an outage to my knowledge and at this shape is pointless.

Maybe we can also rewrite it to simplify it?

Also just worth considering, instead of sliding window, it can use a PID controller (an approximation of it), e.g. store the total connection number the last two minutes, differentiate the last two minutes to get approximation of the derivative and sum them up to get the approximation of the integration value.

We talked about PID controllers at T252091, they certainly have their uses. I'm proposing a P controller here because the control loop is tighter. There's not so much scope for the connection count running away and needing drastic corrective action. The I and D terms model a delay in the response to control input.

It would be interesting to write a system simulator to test a few scenarios. A discrete-event simulation of a cluster of database servers.

Note that I abandoned https://gerrit.wikimedia.org/r/c/mediawiki/core/+/394430 since out approach seemed to be in favor of using loadmonitors and automatic depooling in etcd. More logic in MW seemed to be discouraged at the time.

If CPT/SRE is OK with more tracking logic in MW, something like this proposal seems fine.

We talked about PID controllers at T252091, they certainly have their uses. I'm proposing a P controller here because the control loop is tighter. There's not so much scope for the connection count running away and needing drastic corrective action. The I and D terms model a delay in the response to control input.

There are always spikes that can cause a db to be depooled by mistake. I and D parts can help reducing its impact but sliding window would work as well

Note that I abandoned https://gerrit.wikimedia.org/r/c/mediawiki/core/+/394430 since out approach seemed to be in favor of using loadmonitors and automatic depooling in etcd. More logic in MW seemed to be discouraged at the time.

If CPT/SRE is OK with more tracking logic in MW, something like this proposal seems fine.

If there has a been discussion before, I need to ask for the reasoning in that case before being able to rubber stamp on behalf of SRE. Give me a bit of time.

There are always spikes that can cause a db to be depooled by mistake. I and D parts can help reducing its impact but sliding window would work as well

What do you mean by a spike? If there's a stall, i.e. the server is not completing any requests, I think load should drop fast, and when the stall finishes (even if it's short time later), load should rise fast. That should help to improve end-user latency. The D term encourages rapid control responses to rapid measured changes. The I term slows down the control change, but also causes the long-term average to match the desired value, so the connection count will undershoot the target for a while when a stall finishes.

I fleshed out my idea for a simulator in private notes. The idea is that we could plug in some numbers from production (server count, connection rate, average latency) and then graph the response to a step change in latency. That would help us to tune the constants.

There are always spikes that can cause a db to be depooled by mistake. I and D parts can help reducing its impact but sliding window would work as well

What do you mean by a spike? If there's a stall, i.e. the server is not completing any requests, I think load should drop fast, and when the stall finishes (even if it's short time later), load should rise fast. That should help to improve end-user latency. The D term encourages rapid control responses to rapid measured changes. The I term slows down the control change, but also causes the long-term average to match the desired value, so the connection count will undershoot the target for a while when a stall finishes.

I fleshed out my idea for a simulator in private notes. The idea is that we could plug in some numbers from production (server count, connection rate, average latency) and then graph the response to a step change in latency. That would help us to tune the constants.

So the thing is that the current weights are chosen quite arbitrary and it's actually fragile in large sections (and has caused incidents by simply depooling a host). My thinking in general is something along the lines of giving them all the same weight eventually and mediawiki would adjust the actual weights based on some metrics like number of open connections, average query response time, etc.. It helps specially when we depool a host and the system becomes unstable as result. It can also help in cases when dbs are multiinstance and load on one section goes up while going down in the other daily (e.g. kowki and zhwiki vs. eswiki and frwiki). Generally it's recommended to use PID controllers for LBs. Shopify seems to be happy with only P part of PID. A quick search brought that and these two other applications of PID for LB (one in parallel computation and one in SDN) while I admit I didn't look vert closely at them and obviously their systems are different than ours. Also, it's partially a wish rather than something we probably be able to do in short-term but I was wondering if we can have this long-term idea in mind.

That being said, All of my notes are basically speculation and you're right that we need a simulator. How can I help to set it up?

If CPT/SRE is OK with more tracking logic in MW, something like this proposal seems fine.

I asked around about this, it seems one big concern is to be able to override mw's adjusted values by SREs when there is an outage. It seems the original case was to repool a host gradually automatically and the concern was in scenarios that we are in an outage caused by reduced capacity or general slow down, we need to be able to repool a db ASAP to shorten the outage. I'm not sure if that'd be related here but such cases are important to keep in mind.

Basically I think we should continue working on this, just keeping several things in mind:

  • Ability to turn it off in etcd at least temporarily in case of an outage.
  • Simplifying the LoadMonitor code and logic.
  • Testing and simulating it

Change 636465 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: make LoadMonitor use per-DB cache keys and clean up load scaling

https://gerrit.wikimedia.org/r/636465

Krinkle renamed this task from LoadBalancer connection count weighting to LoadMonitor connection count weighting.Nov 8 2022, 10:35 PM

Change 851777 had a related patch set uploaded (by Krinkle; author: Aaron Schulz):

[mediawiki/core@master] rdbms: add Database method to get connection counts and lag

https://gerrit.wikimedia.org/r/851777

Change 851777 had a related patch set uploaded (by Krinkle; author: Aaron Schulz):

[mediawiki/core@master] rdbms: add Database method to get connection counts and lag

https://gerrit.wikimedia.org/r/851777

Krinkle renamed this task from LoadMonitor connection count weighting to LoadMonitor connection weighting reimagined.Nov 24 2022, 7:09 AM

It looks like apcu contention only matters if some other code using apcu does a slow write. Throwing massive apcu_inc() at apcu by itself shouldn't matter. WRWritesStats only calls getMulti() and incrWithInit(). The incrWithInit() calls use $step = $init and initialization of new apcu keys for time buckets should be comparatively infrequent. apcu_inc only needs an wlock for initialization, otherwise it just needs an rlock and relys on __sync_add_and_fetch() to do the actual increment.

The downside is that the counts would not be shared by host, and k8 would fragment it more. Due to distinctions like index vs api appserver, one cannot assume that connection activity is roughly equal on all appserver hosts in a DC. Also, CLI mode would have to rely on something else.

Clustered memcached has the downside of more and less predictable latency. Use of hash stops in keys could reduce the number or memcached servers involved (e.g. "<part of wrstats key without db host>|#|<db host>"). It would, however, offer more complete counts of connection activity. Use of WRITE_BACKGROUND could mitigate latency on the write path.

As part of the work in this task, we're considering php-apcu and Memcached. A noted downside of php-apcu is that it leaves the question of what to do for CLI. As of Jan 2023, with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/874899, LoadMonitor is disabled in prod for CLI.

I suggest that, if we end up favouring a solution that involves php-apcu as part of this task, we also make it part of core that it then naturally no-ops for CLI, so that prod becomes de-facto default in MW and removes the concern about not having APCU in CLI.

FWIW, There is a rather simpler way instead of scaling weights. it's called “Power of Two Random Choices” Load Balancing which you basically pick two hosts using the weights and actually use the one that has the lower load (connection, etc.). I think that could simplify things and it's proven to be quite good in handling spikes and degraded replicas and so on. Would that be useful?

Change 636465 merged by jenkins-bot:

[mediawiki/core@master] rdbms: improve caching and state convergence in LoadMonitor

https://gerrit.wikimedia.org/r/636465

Change 851777 abandoned by Aaron Schulz:

[mediawiki/core@master] rdbms: make LoadMonitor detect servers with too many clients

Reason:

Still too complex

https://gerrit.wikimedia.org/r/851777

Change 929781 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] [WIP] rdbms: use apcu-based WRStats counters to balance connection counts in LoadMonitor

https://gerrit.wikimedia.org/r/929781

Change 957272 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Avoid lazy loading $loadMonitor in LoadBalancer

https://gerrit.wikimedia.org/r/957272

Change 957272 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Avoid lazy loading $loadMonitor in LoadBalancer

https://gerrit.wikimedia.org/r/957272

I'm tasked to see this through by end of this quarter. I read through the code, the proposal and the WIP patches. I have some comments, specially from DBA/operations point of view. Sorry it's quite long:

  • The biggest problem with status quo is that LM uses replication lag as the only metric for the load on a replica. That doesn't make any sense to me. Almost all replag we have in production is due to large write or lots of writes and in no way indicator of read pressure. It can mean some read pressure when the replica basically gives up and is down which defies the point. Again, in almost all cases, LM currently treats a replica that is going through a large write the same as a replica that's having too many read connections. While it treats a a db that's getting a lot of reads is and on the verge of going down as fully functioning "everything is fine" replica. No wonder this never helped in any outages so far.
    • Maybe it made sense in HDD times. Regardless, It's not the case anymore.
  • That means I fully support the idea of using connections or connection rates as a proxy metric for load. My worry is that building a WRstats system for this is quite costly specially if you put it in APCu it'll be much noisier with mw-on-k8s.
    • Another issue with APCu is that if a replica is overloaded because of extra load from the API cluster, The web cluster appservers will happily continue to send more connections to that replica because they are not seeing any extra load there.
    • I think there might be a simpler solution here, just do a count of processlist on the replica instead of getting replag. I don't know if grants need to be changed or if we can find some other way to gather metric on number of connections but I generally think that mysql should provide you with a usable metric rather than mw trying to guess based on some internal counters. There are lots of options, I'm going to research them and come back to you with some ideas. Maybe it's not possible, then we can revisit.
  • We should look at failure scenarios. I have a couple in mind.
    • One replica is overloaded with connections: It should stop sending requests there. It currently doesn't
    • One replica is inaccessible: It logs an errors (It shouldn't T357824: Ignore one lagging replica in waitForReplication) and doesn't send requests there (good)
    • One replica has a cold cache and is slow to respond: It should give it less connection but slowly ramp it up. It currently doesn't and forcing DBAs to repool a host in four steps over 45 minutes.
    • A whole section is overloaded (a common case actually): It happily sends connection to the replicas which leads to the whole appservers running out CPUs. This has happened so many times I lost count. LM needs to become a circuit breaker for whole sections too and this is not too hard to implement in the class which makes me happy. Even 40% of user requests failing is much better than 100% of them failing.
  • As part of T343098: [epic] Data Persistence Hypothesis WE 3.2.1 this class needs a lot of simplification and clean up. I don't understand the point of ::isStateRefreshDue(), you can add jitter, no need for this. And more stuff like this.
  • Instead of moving average, it could take the previous value and basically turn it into a PID controller. They have been heavily used in designing many large-scale load balancers. Biggest complication is figuring out the coefficients, maybe it's not needed, I need to check this in more depth. Ideas are welcome.

That's for now.

Oh and another thing I remembered: Due to T327852: MediaWiki replication lags are inflated with average of half a second 80 to 90% of the replag being measured is just randomness (assuming the lag is usually between 50-100ms) and that means LM has been just moving weights around mostly at random in the past couple of years.

Change 1004792 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] [WIP] rdbms: Rework LoadMonitor

https://gerrit.wikimedia.org/r/1004792

Change 1013159 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] rdbms: In LoadMonitor, use a loop of getWithSetCallback

https://gerrit.wikimedia.org/r/1013159

Change 1013160 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/EventSimulator@master] Various improvements to support LoadMonitor rework

https://gerrit.wikimedia.org/r/1013160

Change 1013160 merged by jenkins-bot:

[mediawiki/extensions/EventSimulator@master] Various improvements to support LoadMonitor rework

https://gerrit.wikimedia.org/r/1013160

EventSimulator output -- PS14 of 1004792 plus 1013159.

Example command line:

mwscript bin/runEventSimulation.php --class '\MediaWiki\Extension\EventSimulator\Model\LoadBalancer\LBModel' --progress --time-step=0.1 --duration=100 -p scenario=plain-s1

Plain scenario

The only memory this version of LoadMonitor has is the exponentially decaying connection count average. It tends to overshoot and undershoot, in response to transient conditions, especially at startup where there isn't any history. It would oscillate if it wasn't for the connection count time average. But thanks to that average, it tends to converge with reasonable values.

plain-ps14.png (740×1 px, 221 KB)

Slow scenario

Given these startup effects, I patched the other two scenarios to have an event at t=30 rather than t=5.

In the slow scenario, db1119 takes 10x longer to do its "work" query after t=30. We find it gathers about 700 connections before LoadMonitor reacts, completely depooling it. It then oscillates with a period of about 12 seconds. The oscillations seem to be getting smaller so maybe it would converge eventually. Fair to say, compensating for a 10x slowdown is asking a lot of this system.

slow-ps14.png (740×1 px, 161 KB)

Connect timeout scenario

In the connect timeout scenario, db1119 fails all connection attempts after t=30. When a connection is attempted, a delay is imposed equivalent to the configured connect timeout, which is 1s for probe connections or 3s for normal connections. We see a peak of about 2500 connections waiting for the timeout. Then it depools the server, and we see only occasional probe connections after that, an average of 1 connection per second cluster-wide.

timeout-ps14.png (740×1 px, 94 KB)

Next time I work on EventSimulator, I think I should update the model for Kubernetes. I'm assuming 139 MW hosts each with APCu, and a single global cache. With Kubernetes we probably have more pods, so a proportionally smaller APCu cache domain, but we have the on-host memcached tier to compensate.

Source data:

Change #1004792 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Rework LoadMonitor

https://gerrit.wikimedia.org/r/1004792

Change #1013159 merged by jenkins-bot:

[mediawiki/core@master] rdbms: In LoadMonitor, use a loop of getWithSetCallback

https://gerrit.wikimedia.org/r/1013159

Change #929781 abandoned by Ladsgroup:

[mediawiki/core@master] [WIP] rdbms: use apcu-based WRStats counters to balance connection counts in LoadMonitor

Reason:

Done in a different way (I579fb4559a537cf)

https://gerrit.wikimedia.org/r/929781