Special:RunSingleJob waits for replication using LoadBalancer::waitForAll() after every job. @jcrespo reports that when there is replication lag, this causes so many connections waiting in SELECT MASTER_GTID_WAIT() that the connection limit is reached and the site goes down.
The point of waitForAll() is to throttle a loop so that the loop runs at the speed of the slowest slave. There's no point in calling it if the rate at which jobs are executed is not affected by the latency of each job.
My proposal is that the LoadBalancer::waitForAll() call be removed from JobExecutor. Instead, ChangeProp should monitor replication lag itself, and stop popping jobs from Kafka for a given section/partition if the lag is too high.
Also, the concurrency limits should be reviewed.
The exact algorithm that ChangeProp should use to monitor lag is a tricky subject. If it keeps executing jobs until 3s of lag is reached, then stops completely, then job execution may cause the lag to be permanently stuck at 3s, which is too high. But if it offloads at say 0.2s, then it may never get anything done. The difficulty is in assigning a cause to an increase in lag. Any lag that is caused by job execution should be a signal to reduce the rate. Lag that is caused by other things should be a signal too, but a weaker one. Executing jobs at a high rate may exacerbate pre-existing lag, but executing jobs at a near-zero rate should have no effect.
One possibility would be to have a PID controller which controls a rate limiter. The rate limiter could add a sleep time between jobs derived by combining the lag (P), the sum of the lag times tallied over some time interval (I), and the rate at which lag is increasing (D).
Incorporating the integral (I) term means that it can allow temporary lag, but take action if the lag continues. The controller is linear so there are no surprising thresholds, the rate will ramp up and down continuously.
Tuning a PID controller can be difficult, since it can oscillate if the I or D factors are too high. The reason I like it is because it's smart while still being possible to understand and debug. You can expose separate P, I and D metrics as well as the resulting added latency.
Lag can be measured using the MW API e.g. https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb=