Page MenuHomePhabricator

Design a continuous throttling policy for Wikidata bots
Open, Needs TriagePublic

Assigned To
None
Authored By
Pintoch
Dec 11 2019, 11:19 AM
Referenced Files
F31469548: proposedpolicy.png
Dec 11 2019, 11:19 AM
F31469424: maxlag.png
Dec 11 2019, 11:19 AM
F31469546: currentpolicy.png
Dec 11 2019, 11:19 AM
Tokens
"Manufacturing Defect?" token, awarded by Addshore.

Description

With the introduction of the WDQS lag in Wikidata's maxlag computation (T221774), we are now seeing the behaviour I feared: bots start and stop brutally as the lag rises and falls.

maxlag.png (492×939 px, 29 KB)

https://grafana.wikimedia.org/dashboard/snapshot/mbbjQjo7FMnDAath4tuRyP7F9300Wj2S?orgId=1

This morning, bots which used maxlag=5 for their edits (as advised) could only edit about half of the time. This start and stop behaviour is not desirable: bots should slow down gradually as the lag increases instead of running at full speed until the lag reaches 5.

We should agree on a better throttling policy and implement it in most bot editing frameworks (QuickStatements, Pywikibot, OpenRefine, …) to improve everyone's experience with the service.

The current policy looks like this, assuming a default rate of 1 edit/sec:

currentpolicy.png (395×534 px, 13 KB)

We could instead try something like this, with a gradual slowdown as soon as maxlag goes above 2.5 sec (half the threshold where it should stop), for instance:

proposedpolicy.png (480×640 px, 17 KB)

Would this be a sensible throttling policy to encourage? I believe this should avoid the start/stop behaviour shown above, once such a behaviour is adopted by most bots (which should not be super hard: by patching the most popular editing backends, we should cover most of it).

Event Timeline

If clients are able to retrieve the current lag periodically (through some MediaWiki API call? which one?), then this should not require any server-side change. Clients can continue to use maxlag=5 but to also throttle themselves using the smoothed function proposed.

As reported in IRC, maxlag can be checked with, for example, https://www.wikidata.org/w/api.php?action=query&format=json&maxlag=-1
Client could also consider dynamically changing their maxlag value, rather than always having it set to 5.

Thanks! I think dynamically changing the maxlag value is likely to still introduce some thresholds, whereas a continuous slowdown (by retrieving the lag and compute one's edit rate based on it) should in theory reach an equilibrium point.

In the meantime, Wikidata is really unusable with mass-editing tools at the moment. It is hard to convince people to respect maxlag=5 when that prevents them from editing half of the time, so I think it would be worth raising the WDQS factor again. We have identified which tools need to comply better, and having a small factor was useful for that. We probably do not want to stay in this state for weeks (Widar is likely to take a long time to get fixed). We might not want to punish the polite ones too hard!

Just saw this - I'm wondering technically how you would implement it? You could generate a random number between 2.5 and 5, and if maxlag is greater than your random number deny the edit?

It is actually possible to retrieve the current maxlag value from the API without making any edit (see @Addshore's comment above).
So, just retrieve the current maxlag value and compute your desired edit rate for this maxlag with the function plotted above. Then sleep for the appropriate amount of time between any two edits to achieve this rate. Refresh the maxlag value from the server periodically.

It's possible that we could add some sort of suggested wait between actions to the output of maxlag, if that could make things easier.
It would avoid individuals trying to figure out how long to wait..

That's kind of what maxlag is, the time that you should wait before knowing that whatever you have written is replicated everywhere on the sal servers.
We of course now have dispatching and the query service updates piled in their that have slightly different dynamics.

Very broad idea, feel free to discard, I think using industry-wide standards for throttling like token bucket, leaky bucket, fixed-window counter or sliding-window counter might help here.

Very broad idea, feel free to discard, I think using industry-wide standards for throttling like token bucket, leaky bucket, fixed-window counter or sliding-window counter might help here.

One of the primary questions we need to answer is do we want to keep doing this client side self throttling, or switch to something more server side.

It's possible that we could add some sort of suggested wait between actions to the output of maxlag, if that could make things easier.
It would avoid individuals trying to figure out how long to wait..

We have that alrady with the retry_after parameter submitted with back with http_header but the value is always 5 s. See also T210606.

In addition: should read access also be throttled?

I have an idea. I think we should use PoolCounter (which is basically a SaaS, Semaphore as a service) to put a cap on edits happening on wikidata at the same time. This is being used when an article is being reparsed as well, so not too many mw nodes parse an article at the same time (The Michael Jackson effect).

Basically once a request realizes it's going to make an edit in Wikidata, it decreases the semaphore of "edit cap on Wikidata" (let's say initialized by value of 10, meaning only ten edits at the same time can happen in Wikidata). Once the semaphore reaches zero, PoolCounter keeps the 11th mw node trying to lock waiting and responds once one of the ten current ones finishes, if it's more let's say twenty, it just responds with "Too many edits happening". This means edit saving time might be artificially slow when there are more ten edits happening at the same time. Not that this already works fine with parsing articles (look at the blog post), I used this a while back on ores to prevent more than four IPs requesting ores at the same to avoid intentional and unintentional DoSes, It works fine as well.

PoolCounter is a pretty reliable service with almost zero down time and already have a good support inside mediawiki.

What do you think?

What do you think?

Definitely worth considering.
Could be worth an RFC to get wider involvement?
This is essentially edit rate limiting for an entire site.

I'm not sure how ops perhaps would feel about artificially inflating save timing on wikidata for the app servers?

What do you think?

Definitely worth considering.
Could be worth an RFC to get wider involvement?
This is essentially edit rate limiting for an entire site.

I'm not sure how ops perhaps would feel about artificially inflating save timing on wikidata for the app servers?

T247459: Write RFC about site-wide edit rate limiting

Very broad idea, feel free to discard, I think using industry-wide standards for throttling like token bucket, leaky bucket, fixed-window counter or sliding-window counter might help here.

One of the primary questions we need to answer is do we want to keep doing this client side self throttling, or switch to something more server side.

I would have thought that it'd be obvious that this can't be done client side. They can cheat. They don't know what each other are doing. They don't know what other factors are affecting the servers.

As @Ladsgroup hints, this is a basic distributed systems engineering problem with known answers. In addition to rate limiting at ingress, it may be helpful to add backpressure signals between the various internal servers as well as add jitter to the Retry-After signals sent to clients.