Design a continuous throttling policy for Wikidata bots
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Pintoch
	Dec 11 2019, 11:19 AM

Description

With the introduction of the WDQS lag in Wikidata's maxlag computation (T221774), we are now seeing the behaviour I feared: bots start and stop brutally as the lag rises and falls.

https://grafana.wikimedia.org/dashboard/snapshot/mbbjQjo7FMnDAath4tuRyP7F9300Wj2S?orgId=1

This morning, bots which used maxlag=5 for their edits (as advised) could only edit about half of the time. This start and stop behaviour is not desirable: bots should slow down gradually as the lag increases instead of running at full speed until the lag reaches 5.

We should agree on a better throttling policy and implement it in most bot editing frameworks (QuickStatements, Pywikibot, OpenRefine, …) to improve everyone's experience with the service.

The current policy looks like this, assuming a default rate of 1 edit/sec:

We could instead try something like this, with a gradual slowdown as soon as maxlag goes above 2.5 sec (half the threshold where it should stop), for instance:

Would this be a sensible throttling policy to encourage? I believe this should avoid the start/stop behaviour shown above, once such a behaviour is adopted by most bots (which should not be super hard: by patching the most popular editing backends, we should cover most of it).

Related Objects

Mentioned In: T238796: Entity and non-entity edits should use different maxlag
T252091: RFC: Site-wide edit rate limiting with PoolCounter
T247459: Write RFC about site-wide edit rate limiting
T244722: increase factor for query service that is taken into account for maxlag
T245144: Increase Retry-After header for Wikidata
T243701: Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service)
Mentioned Here: T247459: Write RFC about site-wide edit rate limiting
T210606: retry-after value retrieved from http header looks wrong
T221774: Add Wikidata query service lag to Wikidata maxlag

Event Timeline

Pintoch created this task.Dec 11 2019, 11:19 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 11 2019, 11:19 AM

If clients are able to retrieve the current lag periodically (through some MediaWiki API call? which one?), then this should not require any server-side change. Clients can continue to use maxlag=5 but to also throttle themselves using the smoothed function proposed.

Addshore awarded a token.Dec 11 2019, 11:37 AM

Addshore subscribed.

As reported in IRC, maxlag can be checked with, for example, https://www.wikidata.org/w/api.php?action=query&format=json&maxlag=-1
Client could also consider dynamically changing their maxlag value, rather than always having it set to 5.

Thanks! I think dynamically changing the maxlag value is likely to still introduce some thresholds, whereas a continuous slowdown (by retrieving the lag and compute one's edit rate based on it) should in theory reach an equilibrium point.

In the meantime, Wikidata is really unusable with mass-editing tools at the moment. It is hard to convince people to respect maxlag=5 when that prevents them from editing half of the time, so I think it would be worth raising the WDQS factor again. We have identified which tools need to comply better, and having a small factor was useful for that. We probably do not want to stay in this state for weeks (Widar is likely to take a long time to get fixed). We might not want to punish the polite ones too hard!

Just saw this - I'm wondering technically how you would implement it? You could generate a random number between 2.5 and 5, and if maxlag is greater than your random number deny the edit?

It is actually possible to retrieve the current maxlag value from the API without making any edit (see @Addshore's comment above).
So, just retrieve the current maxlag value and compute your desired edit rate for this maxlag with the function plotted above. Then sleep for the appropriate amount of time between any two edits to achieve this rate. Refresh the maxlag value from the server periodically.

Pintoch unsubscribed.Jan 18 2020, 2:33 PM

It's possible that we could add some sort of suggested wait between actions to the output of maxlag, if that could make things easier.
It would avoid individuals trying to figure out how long to wait..

That's kind of what maxlag is, the time that you should wait before knowing that whatever you have written is replicated everywhere on the sal servers.
We of course now have dispatching and the query service updates piled in their that have slightly different dynamics.

Addshore mentioned this in T243701: Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service).Jan 27 2020, 4:38 PM

Very broad idea, feel free to discard, I think using industry-wide standards for throttling like token bucket, leaky bucket, fixed-window counter or sliding-window counter might help here.

Dvorapa added a subscriber: Xqt.Jan 28 2020, 10:32 AM

Dvorapa subscribed.

Strainu subscribed.Jan 29 2020, 12:35 PM

In T240442#5834866, @Ladsgroup wrote:

Very broad idea, feel free to discard, I think using industry-wide standards for throttling like token bucket, leaky bucket, fixed-window counter or sliding-window counter might help here.

One of the primary questions we need to answer is do we want to keep doing this client side self throttling, or switch to something more server side.

In T240442#5815397, @Addshore wrote:

It's possible that we could add some sort of suggested wait between actions to the output of maxlag, if that could make things easier.
It would avoid individuals trying to figure out how long to wait..

We have that alrady with the retry_after parameter submitted with back with http_header but the value is always 5 s. See also T210606.

Bugreporter mentioned this in T245144: Increase Retry-After header for Wikidata.Feb 14 2020, 1:52 PM

Addshore mentioned this in T244722: increase factor for query service that is taken into account for maxlag.Feb 17 2020, 8:08 AM

In addition: should read access also be throttled?

Xqt added a subscriber: valhallasw.Mar 4 2020, 2:52 PM

I have an idea. I think we should use PoolCounter (which is basically a SaaS, Semaphore as a service) to put a cap on edits happening on wikidata at the same time. This is being used when an article is being reparsed as well, so not too many mw nodes parse an article at the same time (The Michael Jackson effect).

Basically once a request realizes it's going to make an edit in Wikidata, it decreases the semaphore of "edit cap on Wikidata" (let's say initialized by value of 10, meaning only ten edits at the same time can happen in Wikidata). Once the semaphore reaches zero, PoolCounter keeps the 11th mw node trying to lock waiting and responds once one of the ten current ones finishes, if it's more let's say twenty, it just responds with "Too many edits happening". This means edit saving time might be artificially slow when there are more ten edits happening at the same time. Not that this already works fine with parsing articles (look at the blog post), I used this a while back on ores to prevent more than four IPs requesting ores at the same to avoid intentional and unintentional DoSes, It works fine as well.

PoolCounter is a pretty reliable service with almost zero down time and already have a good support inside mediawiki.

What do you think?

In T240442#5945682, @Ladsgroup wrote:

What do you think?

Definitely worth considering.
Could be worth an RFC to get wider involvement?
This is essentially edit rate limiting for an entire site.

I'm not sure how ops perhaps would feel about artificially inflating save timing on wikidata for the app servers?

In T240442#5954736, @Addshore wrote:

In T240442#5945682, @Ladsgroup wrote:

What do you think?

Definitely worth considering.
Could be worth an RFC to get wider involvement?
This is essentially edit rate limiting for an entire site.

I'm not sure how ops perhaps would feel about artificially inflating save timing on wikidata for the app servers?

T247459: Write RFC about site-wide edit rate limiting

Ladsgroup mentioned this in T252091: RFC: Site-wide edit rate limiting with PoolCounter.May 7 2020, 1:37 AM

tfmorris subscribed.Jul 4 2020, 7:03 PM

In T240442#5851541, @Addshore wrote:

In T240442#5834866, @Ladsgroup wrote:

Very broad idea, feel free to discard, I think using industry-wide standards for throttling like token bucket, leaky bucket, fixed-window counter or sliding-window counter might help here.

One of the primary questions we need to answer is do we want to keep doing this client side self throttling, or switch to something more server side.

I would have thought that it'd be obvious that this can't be done client side. They can cheat. They don't know what each other are doing. They don't know what other factors are affecting the servers.

As @Ladsgroup hints, this is a basic distributed systems engineering problem with known answers. In addition to rate limiting at ingress, it may be helpful to add backpressure signals between the various internal servers as well as add jitter to the Retry-After signals sent to clients.

Addshore mentioned this in T238796: Entity and non-entity edits should use different maxlag.Oct 13 2020, 7:56 AM

So9q subscribed.Apr 5 2021, 4:49 AM

Design a continuous throttling policy for Wikidata botsOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Design a continuous throttling policy for Wikidata bots
Open, Needs TriagePublic
Actions