As a high-volume editor on Wikidata, I want to ensure that my edits do not impair query service responsiveness.
Problem:
maxlag is a parameter that API users can specify to avoid overloading the wiki: if I send an API request with maxlag=5, and the database replicas are currently more than five seconds behind the master, then MediaWiki will immediately refuse the request. Afterwards, I’m supposed to wait for a bit before retrying the request. See https://www.mediawiki.org/wiki/Manual:Maxlag_parameter.
Last year, we modified the API’s behavior so that this takes into account not just the replication lag, but also the dispatch lag (T194950: Include Wikibase dispatch lag in API "maxlag" enforcing) – if the database replicas are fine, but change dispatching to client wikis is more than 5 minutes behind, then requests with maxlag=5 will still be rejected. (The dispatchLagToMaxLagFactor is configurable, 60 in production, so the threshold for dispatch lag is in minutes instead of seconds.)
However, this does not take the query service lag into account – if updates on some or all of the WDQS servers start to lag behind, edits will continue at full speed as long as database replication and client dispatching are not affected. This can happen because query service lag depends not just on edit rate but also on the size of the entities edited (on each edit, the full entity is reloaded, even if only a small part of it was edited, so editing large items has a disproportionate impact) and the rate of external queries against the server.
BDD
GIVEN all the WDQS servers are lagged by more than one hour
WHEN I send a wbeditentity API request
AND I set the maxlag parameter to 5 seconds
THEN I should get a maxlag error
AND no edit should be made
(the GIVEN part describes a rather extreme case; once the open questions below are qualified, it can perhaps be changed to a more realistic case)
Acceptance criteria:
- the effective max lag takes query service update lag into account
Open questions:
- What should the conversion factor be? (For dispatch lag, it’s 60 – five seconds of replication lag are equivalent to five minutes of dispatch lag.)
- Lag between different servers can differ significantly. Do we use the mean lag? The median? The maximum? Something else? (For dispatch lag, we seem to use the median.)
- maxlag affects all API requests, even ones that shouldn’t have any effect on query service lag, such as action=wbgetentity or action=query&meta=userinfo. Should we try to limit the impact of this change, e. g. by only using query service lag on POST requests? (On the other hand, the same question should apply to dispatch lag and we don’t seem to limit the impact of that as far as I can tell.)
Break-down
- Add ADR
- Implement Caching Prometheus-based solution in Wikidata.org extension
- Add cronjob for updating cached lag info