Page MenuHomePhabricator

Make LoadMonitor server states more up-to-date and respond to outages more quickly
Closed, ResolvedPublic

Assigned To
Authored By
aaron
Oct 13 2020, 6:34 PM
Referenced Files
F36971965: g636465.png
May 2 2023, 1:35 AM
Restricted File
Mar 8 2023, 3:55 AM
Restricted File
Mar 8 2023, 3:52 AM
Restricted File
Mar 8 2023, 3:45 AM
Restricted File
Mar 8 2023, 2:40 AM
F36895727: isrd-stochastic-model.php
Mar 7 2023, 8:51 AM

Description

This task calls for several things:

  • Make the connectivity scaling factor converge faster to avoid connection attempts to downed servers faster. Currently, the logs still get flooded with connection errors. We should both using a higher "moving average ratio" and possible a better moving average function to better handle varying traffic.
  • Make the cache keys per DB-server so that extra query groups do not require separate cache keys (even if they are a subset of the main traffic servers or at least overlap). When checking which servers needs state updates, the order should be shuffled. This makes it easier to keep the data more up-to-date since it no longer requires a for-loop that always has to connect to everything.
  • Simplify the tiered apcu/WANCache logic to use either apcu (web mode) or the local cluster cache (CLI mode). Since both have the BagOStuff interface, this would simplify the code significantly. Placeholders should liberally be used for a few seconds when there is no stale value and the cache mutex is already held. Cluster cache updates from web requests should use WRITE_BACKGROUND to avoid latency.
  • Mitigate network slowness when LoadMonitor polls/gauges servers (e.g. lower connection timeout and set read timeout with mysqli). Maybe there could be a LoadBalancer::CONN_GAUGE_PROBE constant for a third connection class category to help this. Past outages have involved connections hanging in the ACCEPT state, or queries (including heartbeat table and SHOW ones) being slow.
  • Consider adding mysql connected/running/max_connections status variables to the server state polling to see how close a server is too being overloaded.

Server failure scenarios we should be able to quickly handle:

  • Immediately refuses connection attempts (mysql not listening)
  • Immediately rejects connection attempts (max_connections)
  • Times out for connection attempts (packet loss, mysql accept() loop stuck)
  • Takes a long time to accept connections (packet loss, CPU use)
  • Takes a long time to run even trivial non-data queries like SET SESSION sql_mode = 'x' (packet loss, CPU/IO use)
  • Replication stops or slows to a crawl (packet loss, replicated query failure, corruption, CPU/IO use, futex deadlocks)

See T314020 for ideas about tracking connection attempt failures.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 636465 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: rewrite LoadMonitor caching of server states

https://gerrit.wikimedia.org/r/636465

aaron triaged this task as Lowest priority.Jan 7 2022, 1:10 AM
aaron renamed this task from Rewrite LoadMonitor to better handle cache regeneration and improve separation of concern to Make LoadMonitor server states more up-to-date and respond to outages more quickly.Oct 26 2022, 10:20 PM
aaron updated the task description. (Show Details)
aaron updated the task description. (Show Details)

Change 851777 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: add Database method to get connection counts and lag

https://gerrit.wikimedia.org/r/851777

Change 868514 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: add LoadBalancer::CONN_INTENT_PROBE flag for use by LoadMonitor

https://gerrit.wikimedia.org/r/868514

Change 868514 merged by jenkins-bot:

[mediawiki/core@master] rdbms: add CONN_UNTRACKED_GAUGE LoadBalancer flag for LoadMonitor gauging

https://gerrit.wikimedia.org/r/868514

Change 882792 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: improve the moving average method in LoadMonitor

https://gerrit.wikimedia.org/r/882792

Change 882793 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: tweak the refresh probability in LoadMonitor

https://gerrit.wikimedia.org/r/882793

Change 882792 merged by jenkins-bot:

[mediawiki/core@master] rdbms: improve the moving average method in LoadMonitor

https://gerrit.wikimedia.org/r/882792

Modelling the probability function from gerrit 882793 with STATE_TTR_MIN = 0.2 and STATE_TTR_MAX=1.0.

Request rateRefresh rate r^1Refresh rate r^2Refresh rate r^3
11.001.001.00
21.231.081.03
41.481.211.12
81.811.391.24
162.181.611.38
322.581.851.55
642.992.121.73
1283.372.391.92
2563.722.682.13
5124.022.962.34
10244.263.232.55
20484.453.482.77
40964.603.712.98
81924.713.923.19

Produced by P44694, validated with P44695.

You can express that as a cache hit ratio, but that only tells you that the best exponent is infinity, i.e. a step function. But a step function has the worst stampede protection. The idea is that there is some probability function which balances cache hit ratio with stampede protection.

The model above uses equally-spaced requests so can't easily tell you about stampede protection. Maybe it's time to wheel out the big guns. EventSimulator has an easy way to do Poisson-distributed requests.

I modified the cubic formula after testing 4k/8k qps and 8-16ms generation delays.
Toying around with https://www.desmos.com/calculator/r0j2enc9ej was also useful.

I improved the testing script to tabulate more data and use a queue for writes that would be concurrent. {F36897387}

The old formula (0-500ms random TTL) had:

isStateRefreshDueV1 (ac="ave contention", mc="max contention"):

Req/sRegen/s (2ms delay)Regen/s (8ms delay)Regen/s (16ms delay)
10.81 [ac=0.00, mc=2]0.81 [ac=0.01, mc=4]0.82 [ac=0.01, mc=3]
21.37 [ac=0.00, mc=3]1.38 [ac=0.01, mc=3]1.40 [ac=0.02, mc=4]
42.16 [ac=0.00, mc=3]2.20 [ac=0.02, mc=4]2.25 [ac=0.03, mc=4]
83.19 [ac=0.01, mc=3]3.28 [ac=0.03, mc=4]3.41 [ac=0.05, mc=5]
164.56 [ac=0.01, mc=3]4.75 [ac=0.04, mc=4]5.00 [ac=0.08, mc=5]
326.49 [ac=0.01, mc=4]6.86 [ac=0.05, mc=5]7.39 [ac=0.11, mc=6]
649.25 [ac=0.02, mc=3]10.00 [ac=0.08, mc=5]11.06 [ac=0.16, mc=6]
12813.20 [ac=0.03, mc=4]14.72 [ac=0.11, mc=5]16.76 [ac=0.23, mc=7]
25618.89 [ac=0.04, mc=4]21.97 [ac=0.16, mc=5]26.35 [ac=0.35, mc=7]
51227.20 [ac=0.05, mc=4]33.72 [ac=0.23, mc=6]42.21 [ac=0.52, mc=9]
102439.64 [ac=0.08, mc=4]52.04 [ac=0.34, mc=6]70.12 [ac=0.78, mc=11]
204857.83 [ac=0.11, mc=4]83.39 [ac=0.51, mc=8]121.84 [ac=1.23, mc=11]
409686.73 [ac=0.16, mc=5]140.22 [ac=0.78, mc=9]221.49 [ac=2.01, mc=15]
8192129.96 [ac=0.22, mc=6]242.93 [ac=1.22, mc=11]414.84 [ac=3.44, mc=22]
16384201.23 [ac=0.33, mc=6]440.19 [ac=1.99, mc=15]800.72 [ac=6.25, mc=29]

The results of the new cubic formula and some other formulas is shown below:

isStateRefreshDuePow3DelayAware (ac="ave contention", mc="max contention"):

Req/sRegen/s (2ms delay)Regen/s (8ms delay)Regen/s (16ms delay)
10.56 [ac=0.00, mc=2]0.50 [ac=0.00, mc=3]0.42 [ac=0.01, mc=3]
20.79 [ac=0.00, mc=3]0.68 [ac=0.01, mc=3]0.55 [ac=0.01, mc=3]
41.01 [ac=0.00, mc=3]0.86 [ac=0.01, mc=3]0.68 [ac=0.01, mc=4]
81.24 [ac=0.00, mc=3]1.04 [ac=0.01, mc=3]0.82 [ac=0.01, mc=4]
161.48 [ac=0.00, mc=3]1.25 [ac=0.01, mc=4]0.98 [ac=0.02, mc=4]
321.76 [ac=0.00, mc=3]1.50 [ac=0.01, mc=4]1.18 [ac=0.02, mc=4]
642.10 [ac=0.00, mc=3]1.79 [ac=0.01, mc=4]1.41 [ac=0.02, mc=4]
1282.49 [ac=0.00, mc=3]2.14 [ac=0.02, mc=5]1.71 [ac=0.03, mc=4]
2562.98 [ac=0.01, mc=3]2.58 [ac=0.02, mc=3]2.04 [ac=0.03, mc=4]
5123.57 [ac=0.01, mc=3]3.12 [ac=0.02, mc=4]2.51 [ac=0.04, mc=4]
10244.28 [ac=0.01, mc=3]3.74 [ac=0.03, mc=4]3.05 [ac=0.05, mc=4]
20485.12 [ac=0.01, mc=3]4.53 [ac=0.03, mc=4]3.72 [ac=0.06, mc=4]
40966.07 [ac=0.01, mc=3]5.45 [ac=0.04, mc=3]4.61 [ac=0.07, mc=5]
81927.13 [ac=0.01, mc=3]6.63 [ac=0.05, mc=4]5.34 [ac=0.08, mc=4]
163848.85 [ac=0.02, mc=3]8.16 [ac=0.06, mc=4]7.03 [ac=0.10, mc=5]

isStateRefreshDuePow4DelayAware (ac="ave contention", mc="max contention"):

Req/sRegen/s (2ms delay)Regen/s (8ms delay)Regen/s (16ms delay)
10.55 [ac=0.00, mc=2]0.50 [ac=0.00, mc=3]0.44 [ac=0.01, mc=3]
20.76 [ac=0.00, mc=3]0.68 [ac=0.01, mc=3]0.58 [ac=0.01, mc=4]
40.96 [ac=0.00, mc=3]0.84 [ac=0.01, mc=3]0.71 [ac=0.01, mc=4]
81.14 [ac=0.00, mc=3]1.00 [ac=0.01, mc=3]0.84 [ac=0.01, mc=4]
161.32 [ac=0.00, mc=3]1.16 [ac=0.01, mc=3]0.97 [ac=0.02, mc=5]
321.52 [ac=0.00, mc=4]1.35 [ac=0.01, mc=4]1.13 [ac=0.02, mc=5]
641.75 [ac=0.00, mc=3]1.55 [ac=0.01, mc=4]1.31 [ac=0.02, mc=4]
1282.01 [ac=0.00, mc=3]1.80 [ac=0.01, mc=4]1.52 [ac=0.02, mc=4]
2562.33 [ac=0.00, mc=3]2.09 [ac=0.02, mc=3]1.77 [ac=0.03, mc=4]
5122.68 [ac=0.01, mc=3]2.43 [ac=0.02, mc=3]2.05 [ac=0.03, mc=4]
10243.08 [ac=0.01, mc=3]2.81 [ac=0.02, mc=4]2.38 [ac=0.04, mc=4]
20483.53 [ac=0.01, mc=2]3.24 [ac=0.02, mc=3]2.80 [ac=0.04, mc=4]
40964.12 [ac=0.01, mc=3]3.75 [ac=0.03, mc=4]3.30 [ac=0.05, mc=4]
81924.75 [ac=0.01, mc=2]4.51 [ac=0.03, mc=3]4.03 [ac=0.06, mc=4]
163845.52 [ac=0.01, mc=2]5.34 [ac=0.04, mc=4]4.55 [ac=0.07, mc=3]

isStateRefreshDuePow5DelayAware (ac="ave contention", mc="max contention"):

Req/sRegen/s (2ms delay)Regen/s (8ms delay)Regen/s (16ms delay)
10.54 [ac=0.00, mc=2]0.50 [ac=0.00, mc=3]0.45 [ac=0.01, mc=3]
20.74 [ac=0.00, mc=3]0.68 [ac=0.01, mc=3]0.60 [ac=0.01, mc=4]
40.92 [ac=0.00, mc=3]0.84 [ac=0.01, mc=4]0.73 [ac=0.01, mc=4]
81.08 [ac=0.00, mc=3]0.98 [ac=0.01, mc=4]0.85 [ac=0.01, mc=4]
161.23 [ac=0.00, mc=3]1.12 [ac=0.01, mc=4]0.97 [ac=0.01, mc=5]
321.39 [ac=0.00, mc=3]1.26 [ac=0.01, mc=4]1.10 [ac=0.02, mc=4]
641.56 [ac=0.00, mc=3]1.43 [ac=0.01, mc=4]1.25 [ac=0.02, mc=4]
1281.76 [ac=0.00, mc=3]1.61 [ac=0.01, mc=3]1.41 [ac=0.02, mc=5]
2561.97 [ac=0.00, mc=3]1.83 [ac=0.01, mc=4]1.62 [ac=0.02, mc=4]
5122.20 [ac=0.00, mc=3]2.07 [ac=0.02, mc=4]1.84 [ac=0.03, mc=4]
10242.51 [ac=0.01, mc=3]2.33 [ac=0.02, mc=3]2.07 [ac=0.03, mc=4]
20482.82 [ac=0.01, mc=3]2.64 [ac=0.02, mc=3]2.41 [ac=0.04, mc=5]
40963.13 [ac=0.01, mc=3]2.97 [ac=0.02, mc=3]2.70 [ac=0.04, mc=4]
81923.59 [ac=0.01, mc=2]3.50 [ac=0.03, mc=3]3.14 [ac=0.04, mc=3]
163843.95 [ac=0.01, mc=2]3.82 [ac=0.03, mc=3]3.57 [ac=0.05, mc=4]

Change 882793 merged by jenkins-bot:

[mediawiki/core@master] rdbms: tweak the refresh probability in LoadMonitor

https://gerrit.wikimedia.org/r/882793

@aaron: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feburary 22nd, 2023.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

I compared https://gerrit.wikimedia.org/r/c/mediawiki/core/+/636465/ and git master using small scripts:

lm_test.php

echo "Testing with apc.enable_cli=" . ini_get('apc.enable_cli');
$lb = \Wikimedia\TestingAccessWrapper::newFromObject( \MediaWiki\MediaWikiServices::getInstance()->getDBLoadBalancer() );
$prior = []; while(1) { $loads = $lb->groupLoads['']; $lb->getLoadMonitor()->scaleLoads( $loads, $lb->getLocalDomainID() ); if ( $loads !== $prior ) { echo "loads=[" . implode(',',$loads) . "] @" . round(microtime(1),3) . "\n"; } $prior = $loads; time_nanosleep( 0, 10e6 ); }

lm_test.sh

#!/bin/bash
sudo killall php
for i in {1..99}; do php maintenance/run.php eval < lm_test.php > /dev/null & done
php maintenance/run.php eval < lm_test.php

In terminal A:

./lm_test.sh

Then, in terminal B:

sudo systemctl stop mariadb@db3 && date +%s.%3N && sleep 20 && sudo systemctl start mariadb@db3 && date +%s.%3N

On my 3 server setup, it shows convergence time of load weights (db1,db2,db3) from [0,100,100] => [0,100,1] after stopping db3, then convergence from [0,100,1] => [0,100,99]. I get ~5.5-7 seconds for master and ~2.5-3 seconds for https://gerrit.wikimedia.org/r/882793 .

I also tried adding a db5 load with a bogus port then directly injected a cache entry via eval.php:

$cache = \MediaWiki\MediaWikiServices::getInstance()->getMainWANObjectCache();
$cache->set( $cache->makeGlobalKey( 'rdbms-gauge', '2', 's1', 'db1', 'db5' ), [ 'up' => 1.0, 'lag' => 0, 'time' => microtime(1), 'delay' => 0 ], 60 ); echo microtime(1) . "\n";

This yielded similar results.

I ran this with usleep() during the poll phase (1ms-1000ms). Based on previously poor results, the isStateRefreshDue() method exponent was tweaked from $genRatio*128 to min($genRatio*64,0.1) to stay more responsive to slow servers.

I reran the isStateRefreshDue() stochastic simulations for that below.
isStateRefreshDuePow4DelayAwareLate (ac="ave contention", mc="max contention", t="total time"):

Req/sRegen/s (16ms delay)Regen/s (512ms delay)Regen/s (1024ms delay)Regen/s (4096ms delay)
1 (t=2000001)0.02 [ac=0.00, mc=2]0.02 [ac=0.01, mc=3]0.02 [ac=0.02, mc=4]0.01 [ac=0.05, mc=5]
2 (t=1000002)0.02 [ac=0.00, mc=2]0.02 [ac=0.01, mc=3]0.02 [ac=0.02, mc=4]0.02 [ac=0.06, mc=6]
4 (t=500000)0.02 [ac=0.00, mc=2]0.02 [ac=0.01, mc=3]0.02 [ac=0.02, mc=5]0.02 [ac=0.07, mc=5]
8 (t=250000)0.03 [ac=0.00, mc=2]0.03 [ac=0.01, mc=3]0.03 [ac=0.03, mc=4]0.02 [ac=0.08, mc=5]
16 (t=125000)0.03 [ac=0.00, mc=2]0.03 [ac=0.01, mc=3]0.03 [ac=0.03, mc=4]0.03 [ac=0.09, mc=8]
32 (t=62500)0.03 [ac=0.00, mc=2]0.03 [ac=0.02, mc=3]0.04 [ac=0.03, mc=5]0.03 [ac=0.11, mc=7]
64 (t=31250)0.04 [ac=0.00, mc=2]0.04 [ac=0.02, mc=3]0.04 [ac=0.04, mc=4]0.04 [ac=0.13, mc=6]
128 (t=15625)0.05 [ac=0.00, mc=2]0.05 [ac=0.02, mc=3]0.05 [ac=0.05, mc=5]0.05 [ac=0.15, mc=8]
256 (t=7812)0.05 [ac=0.00, mc=2]0.05 [ac=0.03, mc=3]0.06 [ac=0.05, mc=4]0.05 [ac=0.18, mc=5]
512 (t=3906)0.06 [ac=0.00, mc=1]0.06 [ac=0.03, mc=3]0.07 [ac=0.06, mc=6]0.07 [ac=0.21, mc=6]
1024 (t=1953)0.07 [ac=0.00, mc=2]0.07 [ac=0.04, mc=3]0.08 [ac=0.07, mc=4]0.09 [ac=0.26, mc=6]
2048 (t=976)0.08 [ac=0.00, mc=1]0.09 [ac=0.04, mc=3]0.10 [ac=0.09, mc=4]0.11 [ac=0.32, mc=6]
4096 (t=488)0.09 [ac=0.00, mc=2]0.11 [ac=0.05, mc=3]0.11 [ac=0.10, mc=2]0.14 [ac=0.37, mc=6]
8192 (t=244)0.11 [ac=0.00, mc=1]0.10 [ac=0.05, mc=2]0.14 [ac=0.12, mc=3]0.19 [ac=0.47, mc=8]

Change 914030 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/EventSimulator@master] Add ConnectTimeoutScenario and related tweaks

https://gerrit.wikimedia.org/r/914030

EventSimulator modelling of Gerrit 636465, tentative initial results. db1119 fails at t=5, waiting for the configured connection timeout and then incrementing the failed connection count metric. Failures of LoadMonitor related connections can be seen at t=6, and failures of regular connections can be seen at t=8. Aaron's patch reduces the failure count at t=6 and makes the ramp down of connection attempts about 0.5 seconds quicker.

g636465.png (704×1 px, 40 KB)

Change 636465 merged by jenkins-bot:

[mediawiki/core@master] rdbms: improve caching and state convergence in LoadMonitor

https://gerrit.wikimedia.org/r/636465

Change 914030 merged by jenkins-bot:

[mediawiki/extensions/EventSimulator@master] Add ConnectTimeoutScenario and related tweaks

https://gerrit.wikimedia.org/r/914030

What timeout did you use? The LoadMonitor probe connections use 1 second.

Change 922180 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: avoid session variable SET query for LoadMonitor connections

https://gerrit.wikimedia.org/r/922180

Change 922180 merged by jenkins-bot:

[mediawiki/core@master] rdbms: avoid session variable SET query for LoadMonitor connections

https://gerrit.wikimedia.org/r/922180