When a DB server becomes slow for some reason (e.g. T313197, T311106) MediaWiki continues to give it the configured proportion of new connections. This continues until it has 10,000 connections waiting for it. Then connection errors start to occur, and they continue until the LoadMonitor weight (see dashboard) declines. It's unclear if LoadMonitor weight has ever declined sufficiently during an incident to make a difference.
Effectively, a single slow server will suck up all PHP-FPM workers, and if the number of workers is not sufficient, the site will go down.
I propose adjusting the load by looking at the number of connections from the current client host to each potential DB server. We can store connection counts in a local data store such as APCu.
Objectives:
- If all servers have zero connections, the rate of new connections should reflect the configured loads.
- When there are a large number of connections, new connections should be allocated so as to make the connection count ratios match the configured loads.
- A small deviation should cause a small response. MW shouldn't completely depool a server with one connection because the other servers have zero connections.
- If all connection attempts to a given server fail fast with an error, so that the number of open connections is approximately zero, MW shouldn't send all traffic to that server.
Proposal:
Phabricator doesn't do maths, so I put an idea at mw:User:Tim Starling (WMF)/LoadBalancer connection metric. Basically you have a load adjustment which is an absolute percentage, e.g. if the original load is 10% and the adjustment is 10% then you end up with 20%, and then rescale to make all the loads add up to 100%. You take the connection count difference between reality and the model, and scale it down by a tunable parameter, and then scale it down again as the total number of connections increases.
Treat a connection failure the same as a connection held for a long time. Use WRStats with an APCu backend to store the connection failure count over a sliding time window.
Increment and decrement active connection counts in APCu. Apply a sliding time window to that too so that any counter drift is rectified after a few minutes.