Page MenuHomePhabricator

RFC: Site-wide edit rate limiting with PoolCounter
Open, MediumPublic

Description

  • Affected components: Mediawiki Core, Wikibase.
  • Engineer(s) or team for initial implementation: WMDE (Wikidata team)
  • Code steward: TBD.

Motivation

Wikidata is a unique installation of mediawiki. The edit rate on this wiki has been going up to 1,000 edits per minute and has been testing our infrastructure scalability since the day it went live. The edits have been mostly done by bots and bots have noratelimit right meaning no rate limit can be applied to them.

The path for forcing a rate limit for bots in Wikidata was followed and caused several issues so it had to be rolled back: See T184948: limit page creation and edit rate on Wikidata and T192690: Mass message broken on Wikidata after ratelimit workaround. One main reasoning is that bot operators want to edit in full speed when the infrastructure is quiet and forcing an arbitrary number like 100 edits per minute would not solve the issue and limits bots in times that the infrastructure can actually take more. This also broke MassMessage.

With the current flow of edits, WDQS updater can't keep up and was lagging sometimes for days, so now Wikidata considers the median lag of WDQS updater (divided by 60) as a number for maxlag (See T221774: Add Wikidata query service lag to Wikidata maxlag). As a matter of policy, bots stop if maxlag is more than 5 (e.g. the maximum replication lag from master database to replica is more than five seconds or size of the job queue divided by $jobQueueLagFactor is bigger than five). This means if median of lag of 5 minutes for WDQS is reached, most bots are stopped until WDQS updater catches up, then the maxlag goes below five and the bots start to edit, then WDQS starts to lag behind and so on. This has been oscillating like this for months now:
(This is an example of the last six hours)

Changing the factor, for example multiplying it by five (300), only changes the time period of the oscillation: T244722: increase factor for query service that is taken into account for maxlag

It's important to note that the maxlag approach has been causing disruptions for pywikibot and other users that respect maxlag even for read queries. You can see more in T243701: Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service). Even CI of pywikibot has issues because of the maxlag being high all the time: T242081: Pywikibot fails to access Wikidata due to high maxlag lately

The underlying problem of course is WDQS updater not being able to handle the sheer flow of edits and it's currently a scalability bottleneck. This is being addressed in T244590: EPIC: Rework the WDQS updater as an event driven application but we need to keep in mind that there always will be a bottleneck. We can't just dismiss the problem as "WDQS needs to be fixed". Communicating the stress on our infra properly to our users so they know when to slow down or stop is important here and maxlag approach has been proven failing at this scale.

Requirements
  • There has to be a way to cap edits rate site-wide without posing a cap on bots or individual accounts.
    • This can have multiple buckets, like bots in total should not make too many edits so admins will be able to do large batches without getting stuck in the same boat with bots.
    • Also page creation in Wikidata is several times more complex than making edits and page creations should have a different and smaller cap.
  • Starvation must not happen, meaning an enthusiastic bot eating all the quota all the time preventing other bots to edit.
  • No more oscillating behavior

Exploration

Proposal One: Semaphores

This type of problem seems to be already addressed in computer science and semaphores [1] are usually the standard solution in these cases. Meaning we will have a dedicated semaphore initiated with value of N for bots editing Wikidata, while an edit by a bot is being saved, that edit decreases the value of that semaphore and when the value reaches zero, more requests has to hold off until one edit is finished and then they would wake up one of the waiting connections and the new process start saving the edit. If the queue is too long (we can say N), we can simply stop and return a "maxlag" reached to bots. First come, first served would avoid starvation.

In order to implement this, we can use PoolCounter (which is basically a SaaS, Semaphore as a service) that has been working reliably in the past couple of years. PoolCounter is mostly being used when an article is being reparsed already so not too many mw nodes parse an article at the same time (The Michael Jackson effect). PoolCounter is also already used to cap total number of concurrent connections per IP to ores services, see T160692: Use poolcounter to limit number of connections to ores uwsgi.

Implications:

  • Using PoolCounter reduces the work needed to implement this as it's already well supported by MediaWiki.
  • This would artificially increase the edit saving time when there's too many edits happening at the same time.
  • If done incorrectly, processes waiting for the semaphore might hold DB (or other) locks for too long or cause a deadlock between the lock held by the database by one process while another process is waiting for the semaphore to be freed by the first process. Databases have good systems in place to avoid or surface deadlocks but we don't have a system to handle deadlocks between several locking systems a process might use (db, redis lock manager, poolcounter, etc.)
  • If an edit is going to decrease value of several semaphores (e.g. a page creation is also an edit) there's a chance of deadlocks due to random latency happening in network for different processes waiting for each other.
Proposal Two: Continuous throttling

This has been reflected in T240442: Design a continuous throttling policy for Wikidata bots. The problem with current system is that "maxlag" is a hard limit. We can't tell bots to slow down if they are reaching the limit so they continue full speed until everything has to stop.

Implications:

  • There's no easy way to enforce this to our users
  • There's always chance of starvation caused by bots not respecting the policy

It's worth mentioning that proposal one and two are not mutually exclusive.

[1]: A good and free book for people who are not very familiar with semaphores and its applications: The Little Book of Semaphores

Event Timeline

Ladsgroup created this task.May 7 2020, 1:37 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 7 2020, 1:37 AM
@Ladsgroup wrote:

[…] The edit rate on this wiki has been going up to 1,000 edits per minute and has been testing our infrastructure scalability […] The edits have been mostly done by bots […] and bot operators want to edit in full speed when the infrastructure is quiet and forcing an arbitrary number […] limits bots in times that the infrastructure can actually take more.

(Emphasis mine).

@Ladsgroup wrote:

WDQS updater can't keep up and […] we need to keep in mind that there always will be a bottleneck.

It sounds like there are no times where the infrastructure can just handle it all at the current rate. The above suggests that the current rate limit is too high, because we can't keep up with that rate even at normal/quiet times. Right?

If we lower the rate limit, would this pattern not go away? I suppose it could come back if bots use their burst capacity within a single minute, or when there are many different/new bots starting to do the same thing. In that case, the global protections of maxlag and kick in automatically to restore us. Is that not good enough? Would the global rate limit behave differently in practice?

@Ladsgroup wrote:

[…] This has been oscillating like this for months:

It isn't said explicitly, but it sounds the oscillating pattern is considered a problem. Is that right? What kinds of problems is it causing, and for whom/what? I can understand that regularly reaching a lag of 5s is not great, but it seems like an expected outcome if we set the bot maxlag to 5s. If we want the "situation normal" lag peaks to be lower, then we should set that maxlag parameter lower.

daniel added a subscriber: daniel.May 13 2020, 8:25 PM

For reference, Brad used PooLCounter to impose a limit on Special:Contributions recently, see https://gerrit.wikimedia.org/r/c/mediawiki/core/+/551909

@Ladsgroup wrote:

[…] The edit rate on this wiki has been going up to 1,000 edits per minute and has been testing our infrastructure scalability […] The edits have been mostly done by bots […] and bot operators want to edit in full speed when the infrastructure is quiet and forcing an arbitrary number […] limits bots in times that the infrastructure can actually take more.

(Emphasis mine).

@Ladsgroup wrote:

WDQS updater can't keep up and […] we need to keep in mind that there always will be a bottleneck.

It sounds like there are no times where the infrastructure can just handle it all at the current rate. The above suggests that the current rate limit is too high, because we can't keep up with that rate even at normal/quiet times. Right?

No. Let me clarify:

  • By "the infrastructure can actually take more" I mean the times that there are less edits happening for example midnight when human edits are low. or days that a bot is broken/has nothing to do and other bots can go faster
  • "The above suggests that the current rate limit is too high," this is not correct, the problem is that there is no rate limit for bots at all. The group explicitly doesn't have a rate limit. Adding such ratelimit was tried and caused lots of issues (even with a pretty high number). In other words, inside the mediawiki, for bots, we are at mercy of them and based on contracts and API etiquettes, we tell them the pressure on the server and they adjust their speed based on that and maxlag is a proxy of a metric on the pressure of the server. If any bot doesn't respect maxlag, they'll be blocked. but the problem is that maxlag is not a good enough metrics to bots.

If we lower the rate limit, would this pattern not go away?

As I said before, there's no ratelimit for bots.

I suppose it could come back if bots use their burst capacity within a single minute, or when there are many different/new bots starting to do the same thing.

Bursts of a lots of activity are fine, it makes all bots stop for system to recover, the problem right now is that the edits are too high virtually all the time.

In that case, the global protections of maxlag and kick in automatically to restore us. Is that not good enough? Would the global rate limit behave differently in practice?

Yes it would be different, it would keep the flow under control all the time instead of the oscillation.

@Ladsgroup wrote:

[…] This has been oscillating like this for months:

It isn't said explicitly, but it sounds the oscillating pattern is considered a problem. Is that right? What kinds of problems is it causing, and for whom/what? I can understand that regularly reaching a lag of 5s is not great, but it seems like an expected outcome if we set the bot maxlag to 5s. If we want the "situation normal" lag peaks to be lower, then we should set that maxlag parameter lower.

Well, it is a big problem. Please read T243701: Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service) and I mentioned that this pattern even broke CI (travis) of pywikibot.

Joe added a subscriber: Joe.May 20 2020, 6:02 AM
  • "The above suggests that the current rate limit is too high," this is not correct, the problem is that there is no rate limit for bots at all. The group explicitly doesn't have a rate limit. Adding such ratelimit was tried and caused lots of issues (even with a pretty high number).

What kind of issues, specifically?

I find the idea that we can't impose an upper limit to edits per minute bizarre, in abstract, but there might be good reasons for that.

Joe added a comment.EditedMay 20 2020, 6:09 AM

So, while I find the idea of using poolcounter to limit the editing concurrency (it's not rate-limiting, which is different) a good proposal, and in general something desirable to have (including the possibility we tune it down to zero if we're in a crisis for instance), I think the fundamental problem reported here is that WDQS can't ingest the updates fast enough.

So the solution should be searched there; either we improve performance of WDQS in ingesting updates (and I see there are future plans for that) or we stop considering it when calculating maxLag. We should not limit the edits happening to wikidata just because a dependent system can't keep up the pace.

The tuning when in crisis is probably a more accurate description of what we want to aim for, be that automatically or manually.

The issue of wdqs updater should indeed be seen as a seperate issue, and that is being solved seperatly.

Maxlag is currently the system being abused to allow for some sort of rate limit on the site as a whole. You could say we have been in a bit of a constant crisis over the last 6 months regarding expectations of the query service which is critical to many workflows and what the service was able to deliver.

With that in mind though, why do we have maxlag at all? We have the same problem with pure maxlag, as demonstrated at the weekend when one of the S8 dB servers was overwhelmed with a lag of 9 for 12 hours.
Another element of maxlag, the dispatch system, ended up coming in just about this at 15(I think) for the same period.
But the effects of either of those systems reporting that value of maxlag is 0 edits by automated systems for a 12 hour period.
That isn't really desired and instead being able to control concurrency could be seen as an answer.

We could look at the issue this weekend again as an individual problem to fix, as with the query service, but as aliuded to above, there will always be more crisis situations where this mechanism would help.

I can also see this from the other side of the fence, if we're in a situation where wikidata was negatively impacting enwiki, I imagine a response to that would be set Wikidata to read-only for a period, or use maxlag to slow down editing. However that isn't really desirable and having a control mechanism, rather than an on or off would be great.

tstarling added a subscriber: tstarling.EditedMay 21 2020, 12:38 AM

This proposal is effectively a dynamic rate limit except that instead of delivering an error message when it is exceeded, we will just hold the connection open, forcing the bot to wait. That's expensive in terms of server resources -- we'd rather have the client wait using only its own resources. A rate limit has a tunable parameter (the rate) which is not really knowable. Similarly, this proposal has a tunable parameter (the pool size) which is not really knowable. You have to tune the pool size down until the replag stops increasing, but then if the nature of the edits changes, or if the hardware changes, the optimal pool size will change.

I suggested at T202107 that the best method for globally controlling replication lag would be with a PID controller. A PID controller suppresses oscillation by having a memory of recent changes in the metric. The P (proportional) term is essentially as proposed at T240442 -- just back off proportionally as the lag increases. The problem with this is that it will settle into an equilibrium lag somewhere in the middle of the range. The I (integral) term addresses this by maintaining a rolling average and adjusting the control value until the average meets the desired value. This allows it to maintain approximately the same edit rate but with a lower average replication lag. The D (derivative) term causes the control value to be reduced more aggressively if the metric is rising quickly.

My proposal is to use a PID controller to set the Retry-After header. Clients would be strongly encouraged to respect that header. We could have say maxlag=auto to opt in to this system.

Really the client has to wait every time, so there needs to be a delay hint header like Retry-After with every response. So it's not exactly maxlag=auto.

  • "The above suggests that the current rate limit is too high," this is not correct, the problem is that there is no rate limit for bots at all. The group explicitly doesn't have a rate limit. Adding such ratelimit was tried and caused lots of issues (even with a pretty high number).

What kind of issues, specifically?

I find the idea that we can't impose an upper limit to edits per minute bizarre, in abstract, but there might be good reasons for that.

It broke MassMessage T192690: Mass message broken on Wikidata after ratelimit workaround and see the discussions in T184948: limit page creation and edit rate on Wikidata

So, while I find the idea of using poolcounter to limit the editing concurrency (it's not rate-limiting, which is different) a good proposal, and in general something desirable to have (including the possibility we tune it down to zero if we're in a crisis for instance), I think the fundamental problem reported here is that WDQS can't ingest the updates fast enough.

My opinion is that there always be a bottleneck in rate of digesting edits in some parts of infra, if we fix WDQS in the next couple of months, edits also scale up and we might hit similar issue in, for example, search index update. See T243701#6152282

So the solution should be searched there; either we improve performance of WDQS in ingesting updates (and I see there are future plans for that) or we stop considering it when calculating maxLag. We should not limit the edits happening to wikidata just because a dependent system can't keep up the pace.

In paper they are dependent but in reality they are not. When we didn't count WDQS lag into maxlag, sometimes the lag was as high as half a day (and growing). This actually caused issues because lots of tools and systems that edit wikidata use WDQS and they started doing basic GIGO because they were getting outdated data, they used that to add wrong data to wikidata and this feedback loop caused issues. Also, it's safe to assume WDQS is lagged maybe even half an hour but when it's lagged for half a day, it breaks lots of implicit assumptions in tool builders, similar if search index in Wikipedia starts to lag behind for a day.

This proposal is effectively a dynamic rate limit except that instead of delivering an error message when it is exceeded, we will just hold the connection open, forcing the bot to wait. That's expensive in terms of server resources -- we'd rather have the client wait using only its own resources. A rate limit has a tunable parameter (the rate) which is not really knowable. Similarly, this proposal has a tunable parameter (the pool size) which is not really knowable. You have to tune the pool size down until the replag stops increasing, but then if the nature of the edits changes, or if the hardware changes, the optimal pool size will change.

I suggested at T202107 that the best method for globally controlling replication lag would be with a PID controller. A PID controller suppresses oscillation by having a memory of recent changes in the metric. The P (proportional) term is essentially as proposed at T240442 -- just back off proportionally as the lag increases. The problem with this is that it will settle into an equilibrium lag somewhere in the middle of the range. The I (integral) term addresses this by maintaining a rolling average and adjusting the control value until the average meets the desired value. This allows it to maintain approximately the same edit rate but with a lower average replication lag. The D (derivative) term causes the control value to be reduced more aggressively if the metric is rising quickly.

My proposal is to use a PID controller to set the Retry-After header. Clients would be strongly encouraged to respect that header. We could have say maxlag=auto to opt in to this system.

That sounds like a good alternative that needs exploring, I haven't thought about it in depth but I promise to do and come back to you.

This proposal is effectively a dynamic rate limit except that instead of delivering an error message when it is exceeded, we will just hold the connection open, forcing the bot to wait. That's expensive in terms of server resources -- we'd rather have the client wait using only its own resources. A rate limit has a tunable parameter (the rate) which is not really knowable. Similarly, this proposal has a tunable parameter (the pool size) which is not really knowable. You have to tune the pool size down until the replag stops increasing, but then if the nature of the edits changes, or if the hardware changes, the optimal pool size will change.

I suggested at T202107 that the best method for globally controlling replication lag would be with a PID controller. A PID controller suppresses oscillation by having a memory of recent changes in the metric. The P (proportional) term is essentially as proposed at T240442 -- just back off proportionally as the lag increases. The problem with this is that it will settle into an equilibrium lag somewhere in the middle of the range. The I (integral) term addresses this by maintaining a rolling average and adjusting the control value until the average meets the desired value. This allows it to maintain approximately the same edit rate but with a lower average replication lag. The D (derivative) term causes the control value to be reduced more aggressively if the metric is rising quickly.

My proposal is to use a PID controller to set the Retry-After header. Clients would be strongly encouraged to respect that header. We could have say maxlag=auto to opt in to this system.

I quite like the idea of using PID but there are three notes I want to mention:

  • With PID, we need to define three constants K_p, K_i and K_d. If we had problem with finding the pool size, this is going to get three times more complicated (I didn't find a standard way to determine these coefficients, maybe I'm missing something obvious)
  • We currently don't have an infrastructure to hold the "maxlag" data over time so we can calculate its derivative and integral. Should we use redis? How it's going to look like? These are questions, I don't have answers for them. Do you have ideas for that?
  • I'm not sure "Retry-After" is a good header for 2xx responses. It's like "We accepted your edit but "retry" it after 2 seconds". I looked at RFC 7231 and it doesn't explicitly say we can't use it in 2xx requests but I haven't seen anywhere use it in 2xx responses. We might be able to find another better header?

I hope you don't mind if I contradict my previous comment a bit, since my thinking is still evolving on this.

One problem with using lag as the metric is that it doesn't go negative, so the integral will not be pulled down while the service is idle. We could subtract a target lag, say 1 minute, but that loses some of the supposed benefit of including an integral term. A better metric would be updater load, i.e. demand/capacity. When the load is more than 100%, the lag increases at a rate of 1 second per second, but there's no further information in there as to how heavily overloaded it is. When the load is less than 100%, lag decreases until it reaches zero. While it's decreasing, the slope tells you something about how underloaded it is, but once it hits zero, you lose that information.

Load is average queue size, if you take the currently running batch as being part of the queue. WDQS currently does not monitor the queue size. I gather (after an hour or so of research, I'm new to all this) that with some effort, KafkaPoller could obtain an estimate of the queue size by subtracting the current partition offsets from KafkaConsumer.endOffsets().

Failing that, we can make a rough approximation from available data. We can get the average utilisation of the importer from the rdf-repository-import-time-cnt metric. You can see in Grafana that the derivative of this metric hovers between 0 and 1 when WDQS is not lagged, and remains near 1 when WDQS is lagged. The metric I would propose is to add replication lag to this utilisation metric, appropriately scaled: utilisation + K_lag * lag - 1 where K_lag is say 1/60s. This is a metric which is -1 at idle, 0 when busy with no lag, and 1 with 1 minute of lag. The control system would adjust the request rate to keep this metric (and its integral) at zero.

With PID, we need to define three constants K_p, K_i and K_d. If we had problem with finding the pool size, this is going to get three times more complicated (I didn't find a standard way to determine these coefficients, maybe I'm missing something obvious)

One way to simplify it is with K_d=0, i.e. make it a PI controller. Having the derivative in there probably doesn't add much. Then it's only two times more complicated. Although I added K_lag so I suppose we are still at 3. The idea is that it shouldn't matter too much exactly what K_p and K_i are set to -- the system should be stable and have low lag with a wide range of parameter values. So you just pick some values and see if it works.

We currently don't have an infrastructure to hold the "maxlag" data over time so we can calculate its derivative and integral. Should we use redis? How it's going to look like? These are questions, I don't have answers for them. Do you have ideas for that?

WDQS lag is currently obtained by having an ApiMaxLagInfo hook handler which queries Prometheus, caching the result. Prometheus has a query language which can perform derivatives ("rate") and integrals ("sum_over_time") on metrics. So it would be the same system as now, just with a different Prometheus query.

I'm not sure "Retry-After" is a good header for 2xx responses. It's like "We accepted your edit but "retry" it after 2 seconds". I looked at RFC 7231 and it doesn't explicitly say we can't use it in 2xx requests but I haven't seen anywhere use it in 2xx responses. We might be able to find another better header?

The wording in RFC 7231 suggests to me that it is acceptable to use Retry-After in a 2xx response. "Servers send the "Retry-After" header field to indicate how long the user agent ought to wait before making a follow-up request." That seems pretty close to what we're doing.

In summary, we query Prometheus for utilisation + lag / 60 - 1, both the most recent value and the sum over some longer time interval. The sum and the value are separately scaled, then they are added together, then the result is limited to some reasonable range like 0-600s. If it's >0, then we send it as a Retry-After header. Then we badger all bots into respecting the header.

Load is average queue size, if you take the currently running batch as being part of the queue. WDQS currently does not monitor the queue size. I gather (after an hour or so of research, I'm new to all this) that with some effort, KafkaPoller could obtain an estimate of the queue size by subtracting the current partition offsets from KafkaConsumer.endOffsets().

This metric is available in graphana through kafka_burrow_partition_lag, problem is that for some reasons we stopped polling updates from Kafka and we're now consuming the RC change API. The reasons we disabled it are now fixed so I believe we could enable it again.

In the ideal case the updater runs at full speed most of the time as the effect of the maxlag propagates fast enough that the system in place works for what it was designed: make sure users don't query and see too much out of date data and don't starve too much when the threshold is green again.
One problem that the current maxlag strategy does not address properly is when a single server is lagged situations like this starts to happen:


The median across all pooled servers being used the effect of the maxlag no longer propagates fast enough, for high lagged servers they see the effect of the edit rate slowdown that happened 10 mins ago while others sees their queue being emptied while they could have handled more. All this being pretty much random (spikes across servers are at different times) it exacerbates even more the oscillation. Was it evaluated to take the max or the sum instead of the median?

As said in a previous comment there always be bottleneck somewhere, I feel that having a single fixed limit makes it a bit difficult to handle the variance in the edit rate and could encourage us to always tune it to a lower value to resolve such lag issues without knowing when your system can handle more.
A solution around Retry-After and a PID controller seems a bit more flexible to me, the main drawbacks is that it relies on well behaved clients (which is currently the case).

As for addressing the issue with the updater itself, we believe we have room for optimizations by redesigning the way we perform updates. The current situation is clearly not ideal but it can keep-up the update rates when bots are slowed down which gives us I hope enough time to finish the work we started on this rewrite.

The median across all pooled servers being used the effect of the maxlag no longer propagates fast enough, for high lagged servers they see the effect of the edit rate slowdown that happened 10 mins ago while others sees their queue being emptied while they could have handled more. All this being pretty much random (spikes across servers are at different times) it exacerbates even more the oscillation. Was it evaluated to take the max or the sum instead of the median?

Yup, currently blocked on T238751

I hope you don't mind if I contradict my previous comment a bit, since my thinking is still evolving on this.

No worries at all. I'm also changing my mind quickly here.

One problem with using lag as the metric is that it doesn't go negative, so the integral will not be pulled down while the service is idle. We could subtract a target lag, say 1 minute, but that loses some of the supposed benefit of including an integral term. A better metric would be updater load, i.e. demand/capacity. When the load is more than 100%, the lag increases at a rate of 1 second per second, but there's no further information in there as to how heavily overloaded it is. When the load is less than 100%, lag decreases until it reaches zero. While it's decreasing, the slope tells you something about how underloaded it is, but once it hits zero, you lose that information.

Load is average queue size, if you take the currently running batch as being part of the queue. WDQS currently does not monitor the queue size. I gather (after an hour or so of research, I'm new to all this) that with some effort, KafkaPoller could obtain an estimate of the queue size by subtracting the current partition offsets from KafkaConsumer.endOffsets().

Failing that, we can make a rough approximation from available data. We can get the average utilisation of the importer from the rdf-repository-import-time-cnt metric. You can see in Grafana that the derivative of this metric hovers between 0 and 1 when WDQS is not lagged, and remains near 1 when WDQS is lagged. The metric I would propose is to add replication lag to this utilisation metric, appropriately scaled: utilisation + K_lag * lag - 1 where K_lag is say 1/60s. This is a metric which is -1 at idle, 0 when busy with no lag, and 1 with 1 minute of lag. The control system would adjust the request rate to keep this metric (and its integral) at zero.

With PID, we need to define three constants K_p, K_i and K_d. If we had problem with finding the pool size, this is going to get three times more complicated (I didn't find a standard way to determine these coefficients, maybe I'm missing something obvious)

One way to simplify it is with K_d=0, i.e. make it a PI controller. Having the derivative in there probably doesn't add much. Then it's only two times more complicated. Although I added K_lag so I suppose we are still at 3. The idea is that it shouldn't matter too much exactly what K_p and K_i are set to -- the system should be stable and have low lag with a wide range of parameter values. So you just pick some values and see if it works.

We currently don't have an infrastructure to hold the "maxlag" data over time so we can calculate its derivative and integral. Should we use redis? How it's going to look like? These are questions, I don't have answers for them. Do you have ideas for that?

WDQS lag is currently obtained by having an ApiMaxLagInfo hook handler which queries Prometheus, caching the result. Prometheus has a query language which can perform derivatives ("rate") and integrals ("sum_over_time") on metrics. So it would be the same system as now, just with a different Prometheus query.

I might be a little YAGNI here but I would love to have maxlag numbers be kept over time and we build PI controller using the maxlag value and not the lag of WDQS. Mostly because WDQS hopefully will be fixed and handled later but there will be some sort of edit rate bottleneck all the time (jobqueue, replication, you name it) but if you think we can work on WDQS for now, I'm okay. My thinking was to have a P controller for start based on the maxlag and build the infrastructure to keep the data over time (maybe Prometheus?, query statsd? We already store all maxlag there here but it seems broken atm) and add it there. I think oscillating over 3s is much better than oscillating around 5s because over 5s, the system doesn't accept the edit and the user have to re-send it.

The wording in RFC 7231 suggests to me that it is acceptable to use Retry-After in a 2xx response. "Servers send the "Retry-After" header field to indicate how long the user agent ought to wait before making a follow-up request." That seems pretty close to what we're doing.

ack. I think we should communicate this with the tool developers (and pywikibot folks) so they start taking the header all the time.

Dvorapa added a subscriber: Xqt.EditedMay 30 2020, 8:10 PM

Note: As @Xqt pointed out Retry-After currently just serves "5" all the time maxlag parameter is beaten, so Pywikibot doesn't use it (as it looks hard-coded to 5) and rather calculates its own Retry-After based on current maxlag and amount of attempts so far instead.

But anyway, it would be great to make Retry-After work (and not just switch between null and 5) and adapt tools to use it as discussed many times before.