- Affected components: Mediawiki Core, Wikibase.
- Engineer(s) or team for initial implementation: WMDE (Wikidata team)
- Code steward: TBD.
Wikidata is a unique installation of mediawiki. The edit rate on this wiki has been going up to 1,000 edits per minute and has been testing our infrastructure scalability since the day it went live. The edits have been mostly done by bots and bots have noratelimit right meaning no rate limit can be applied to them.
The path for forcing a rate limit for bots in Wikidata was followed and caused several issues so it had to be rolled back: See T184948: limit page creation and edit rate on Wikidata and T192690: Mass message broken on Wikidata after ratelimit workaround. One main reasoning is that bot operators want to edit in full speed when the infrastructure is quiet and forcing an arbitrary number like 100 edits per minute would not solve the issue and limits bots in times that the infrastructure can actually take more. This also broke MassMessage.
With the current flow of edits, WDQS updater can't keep up and was lagging sometimes for days, so now Wikidata considers the median lag of WDQS updater (divided by 60) as a number for maxlag (See T221774: Add Wikidata query service lag to Wikidata maxlag). As a matter of policy, bots stop if maxlag is more than 5 (e.g. the maximum replication lag from master database to replica is more than five seconds or size of the job queue divided by $jobQueueLagFactor is bigger than five). This means if median of lag of 5 minutes for WDQS is reached, most bots are stopped until WDQS updater catches up, then the maxlag goes below five and the bots start to edit, then WDQS starts to lag behind and so on. This has been oscillating like this for months now:
(This is an example of the last six hours)
Changing the factor, for example multiplying it by five (300), only changes the time period of the oscillation: T244722: increase factor for query service that is taken into account for maxlag
It's important to note that the maxlag approach has been causing disruptions for pywikibot and other users that respect maxlag even for read queries. You can see more in T243701: Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service). Even CI of pywikibot has issues because of the maxlag being high all the time: T242081: Pywikibot fails to access Wikidata due to high maxlag lately
The underlying problem of course is WDQS updater not being able to handle the sheer flow of edits and it's currently a scalability bottleneck. This is being addressed in T244590: [Epic] Rework the WDQS updater as an event driven application but we need to keep in mind that there always will be a bottleneck. We can't just dismiss the problem as "WDQS needs to be fixed". Communicating the stress on our infra properly to our users so they know when to slow down or stop is important here and maxlag approach has been proven failing at this scale.
- There has to be a way to cap edits rate site-wide without posing a cap on bots or individual accounts.
- This can have multiple buckets, like bots in total should not make too many edits so admins will be able to do large batches without getting stuck in the same boat with bots.
- Also page creation in Wikidata is several times more complex than making edits and page creations should have a different and smaller cap.
- Starvation must not happen, meaning an enthusiastic bot eating all the quota all the time preventing other bots to edit.
- No more oscillating behavior
Proposal One: Semaphores
This type of problem seems to be already addressed in computer science and semaphores  are usually the standard solution in these cases. Meaning we will have a dedicated semaphore initiated with value of N for bots editing Wikidata, while an edit by a bot is being saved, that edit decreases the value of that semaphore and when the value reaches zero, more requests has to hold off until one edit is finished and then they would wake up one of the waiting connections and the new process start saving the edit. If the queue is too long (we can say N), we can simply stop and return a "maxlag" reached to bots. First come, first served would avoid starvation.
In order to implement this, we can use PoolCounter (which is basically a SaaS, Semaphore as a service) that has been working reliably in the past couple of years. PoolCounter is mostly being used when an article is being reparsed already so not too many mw nodes parse an article at the same time (The Michael Jackson effect). PoolCounter is also already used to cap total number of concurrent connections per IP to ores services, see T160692: Use poolcounter to limit number of connections to ores uwsgi.
- Using PoolCounter reduces the work needed to implement this as it's already well supported by MediaWiki.
- This would artificially increase the edit saving time when there's too many edits happening at the same time.
- If done incorrectly, processes waiting for the semaphore might hold DB (or other) locks for too long or cause a deadlock between the lock held by the database by one process while another process is waiting for the semaphore to be freed by the first process. Databases have good systems in place to avoid or surface deadlocks but we don't have a system to handle deadlocks between several locking systems a process might use (db, redis lock manager, poolcounter, etc.)
- If an edit is going to decrease value of several semaphores (e.g. a page creation is also an edit) there's a chance of deadlocks due to random latency happening in network for different processes waiting for each other.
Proposal Two: Continuous throttling
This has been reflected in T240442: Design a continuous throttling policy for Wikidata bots. The problem with current system is that "maxlag" is a hard limit. We can't tell bots to slow down if they are reaching the limit so they continue full speed until everything has to stop.
- There's no easy way to enforce this to our users
- There's always chance of starvation caused by bots not respecting the policy
It's worth mentioning that proposal one and two are not mutually exclusive.
Proposal three: Use PID controller (T252091#6154167)
Possibly only PI part of the controller (by setting k_d to zero). The user facing part will be that we calculate Retry-After header and we send it back to users in all requests (including successful ones) and bots have to respect that value and avoid making a subsequent request in shorter period of time (require changing in bot policy)
: A good and free book for people who are not very familiar with semaphores and its applications: The Little Book of Semaphores