Page MenuHomePhabricator

Add Wikidata query service lag to Wikidata maxlag
Closed, ResolvedPublic8 Story Points

Description

As a high-volume editor on Wikidata, I want to ensure that my edits do not impair query service responsiveness.

Problem:
maxlag is a parameter that API users can specify to avoid overloading the wiki: if I send an API request with maxlag=5, and the database replicas are currently more than five seconds behind the master, then MediaWiki will immediately refuse the request. Afterwards, I’m supposed to wait for a bit before retrying the request. See https://www.mediawiki.org/wiki/Manual:Maxlag_parameter.

Last year, we modified the API’s behavior so that this takes into account not just the replication lag, but also the dispatch lag (T194950: Include Wikibase dispatch lag in API "maxlag" enforcing) – if the database replicas are fine, but change dispatching to client wikis is more than 5 minutes behind, then requests with maxlag=5 will still be rejected. (The dispatchLagToMaxLagFactor is configurable, 60 in production, so the threshold for dispatch lag is in minutes instead of seconds.)

However, this does not take the query service lag into account – if updates on some or all of the WDQS servers start to lag behind, edits will continue at full speed as long as database replication and client dispatching are not affected. This can happen because query service lag depends not just on edit rate but also on the size of the entities edited (on each edit, the full entity is reloaded, even if only a small part of it was edited, so editing large items has a disproportionate impact) and the rate of external queries against the server.

BDD
GIVEN all the WDQS servers are lagged by more than one hour
WHEN I send a wbeditentity API request
AND I set the maxlag parameter to 5 seconds
THEN I should get a maxlag error
AND no edit should be made

(the GIVEN part describes a rather extreme case; once the open questions below are qualified, it can perhaps be changed to a more realistic case)

Acceptance criteria:

  • the effective max lag takes query service update lag into account

Open questions:

  • What should the conversion factor be? (For dispatch lag, it’s 60 – five seconds of replication lag are equivalent to five minutes of dispatch lag.)
  • Lag between different servers can differ significantly. Do we use the mean lag? The median? The maximum? Something else? (For dispatch lag, we seem to use the median.)
  • maxlag affects all API requests, even ones that shouldn’t have any effect on query service lag, such as action=wbgetentity or action=query&meta=userinfo. Should we try to limit the impact of this change, e.β€―g. by only using query service lag on POST requests? (On the other hand, the same question should apply to dispatch lag and we don’t seem to limit the impact of that as far as I can tell.)

Break-down

Details

Related Gerrit Patches:
operations/mediawiki-config : masterwgWikidataOrgQueryServiceMaxLagFactor 60
mediawiki/extensions/Wikidata.org : masterqsmaxlag, send altered lag info, not only raw lag
mediawiki/extensions/Wikidata.org : wmf/1.35.0-wmf.5qsmaxlag, send altered lag info, not only raw lag
operations/mediawiki-config : masterwgWikidataOrgQueryServiceMaxLagFactor 180
operations/puppet : productionmediawiki/wikidata maint cron for updateQueryServiceLag
mediawiki/extensions/Wikidata.org : wmf/1.35.0-wmf.5Add ApiMaxLagInfo hook
mediawiki/extensions/Wikidata.org : masteronApiMaxLagInfo ttl to 1
mediawiki/extensions/Wikidata.org : masterMaintenance script to update cached wdqs lag
mediawiki/extensions/Wikidata.org : wmf/1.35.0-wmf.5Maintenance script to update cached wdqs lag
mediawiki/extensions/Wikidata.org : wmf/1.35.0-wmf.5Switch from load_composer_autoloader AutoloadNamespaces
mediawiki/extensions/Wikidata.org : wmf/1.35.0-wmf.5Add WDQS lag into request maxlag logic.
mediawiki/extensions/Wikidata.org : masterAdd ApiMaxLagInfo hook
mediawiki/extensions/Wikidata.org : masterSwitch from load_composer_autoloader AutoloadNamespaces
mediawiki/extensions/Wikidata.org : masterAdd WDQS lag into request maxlag logic.
mediawiki/extensions/Wikidata.org : masterAdd ADR for the solution design on including WDQS lag into maxlag
mediawiki/extensions/Wikibase : masterAdd Wikidata query service lag to Wikidata maxlag

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 532968 had a related patch set uploaded (by Alaa Sarhan; owner: Alaa Sarhan):
[mediawiki/extensions/Wikidata.org@master] Add ADR for the solution design on including WDQS lag into maxlag

https://gerrit.wikimedia.org/r/532968

Change 532968 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@master] Add ADR for the solution design on including WDQS lag into maxlag

https://gerrit.wikimedia.org/r/532968

alaa_wmde removed alaa_wmde as the assignee of this task.Sep 6 2019, 10:01 AM
Restricted Application added a project: User-Addshore. Β· View Herald TranscriptMon, Nov 18, 12:41 PM

Change 551520 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@master] Maintenance script to update cached wdqs lag

https://gerrit.wikimedia.org/r/551520

Change 532495 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@master] Add WDQS lag into request maxlag logic.

https://gerrit.wikimedia.org/r/532495

Change 551574 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@master] Add ApiMaxLagInfo hook

https://gerrit.wikimedia.org/r/551574

Change 551582 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/puppet@production] DNM: mediawiki/wikidata maint cron for updateQueryServiceLag

https://gerrit.wikimedia.org/r/551582

Change 551640 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@master] Switch from load_composer_autoloader AutoloadNamespaces

https://gerrit.wikimedia.org/r/551640

Change 551640 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@master] Switch from load_composer_autoloader AutoloadNamespaces

https://gerrit.wikimedia.org/r/551640

Change 551520 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@master] Maintenance script to update cached wdqs lag

https://gerrit.wikimedia.org/r/551520

@alaa_wmde @Matthias_Geisler_WMDE FYI: I asked this to be pulled into camp again and prioritized because we have had several big issues with significant community upset that would likely be avoided with this. So we need to get this done faster now.

Change 551574 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@master] Add ApiMaxLagInfo hook

https://gerrit.wikimedia.org/r/551574

Change 551855 had a related patch set uploaded (by Addshore; owner: Alaa Sarhan):
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] Add WDQS lag into request maxlag logic.

https://gerrit.wikimedia.org/r/551855

Change 551856 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] Switch from load_composer_autoloader AutoloadNamespaces

https://gerrit.wikimedia.org/r/551856

Change 551857 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] Maintenance script to update cached wdqs lag

https://gerrit.wikimedia.org/r/551857

Change 551858 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] Add ApiMaxLagInfo hook

https://gerrit.wikimedia.org/r/551858

Just prepared the patches needed to backport this (as otherwise it won't go out with the train and land for 2-3 weeks.

Change 551855 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] Add WDQS lag into request maxlag logic.

https://gerrit.wikimedia.org/r/551855

Change 551856 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] Switch from load_composer_autoloader AutoloadNamespaces

https://gerrit.wikimedia.org/r/551856

Mentioned in SAL (#wikimedia-operations) [2019-11-19T17:11:30Z] <addshore@deploy1001> Synchronized php-1.35.0-wmf.5/extensions/Wikidata.org: T221774 - Wikidata.org extension (queryservice maxlag) [[gerrit:551855]] [[gerrit:551856]] (duration: 00m 54s)

Change 551869 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@master] onApiMaxLagInfo ttl to 1

https://gerrit.wikimedia.org/r/551869

Change 551857 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] Maintenance script to update cached wdqs lag

https://gerrit.wikimedia.org/r/551857

Mentioned in SAL (#wikimedia-operations) [2019-11-19T17:43:52Z] <addshore@deploy1001> Synchronized php-1.35.0-wmf.5/extensions/Wikidata.org: T221774 - Wikidata.org extension (queryservice maxlag, maint script) [[gerrit:551857]] (duration: 00m 52s)

Change 551869 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@master] onApiMaxLagInfo ttl to 1

https://gerrit.wikimedia.org/r/551869

Change 551858 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] Add ApiMaxLagInfo hook

https://gerrit.wikimedia.org/r/551858

Mentioned in SAL (#wikimedia-operations) [2019-11-19T18:20:28Z] <addshore@deploy1001> Synchronized php-1.35.0-wmf.5/extensions/Wikidata.org: T221774 - Wikidata.org extension (queryservice maxlag, hook) [[gerrit:551858]] (duration: 00m 53s)

Addshore added a comment.EditedTue, Nov 19, 6:20 PM

Just the cron job to go now
https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/551582/
The default settings mean that 1 hour of query lag = 1 second of maxlag.
I'll look at bringing this setting down once everything (including cron) is actually deployed.

Change 551582 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki/wikidata maint cron for updateQueryServiceLag

https://gerrit.wikimedia.org/r/551582

Change 552069 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] wgWikidataOrgQueryServiceMaxLagFactor 180

https://gerrit.wikimedia.org/r/552069

Change 552069 merged by jenkins-bot:
[operations/mediawiki-config@master] wgWikidataOrgQueryServiceMaxLagFactor 180

https://gerrit.wikimedia.org/r/552069

Change 552071 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@master] qsmaxlag, send altered lag info, not only raw lag

https://gerrit.wikimedia.org/r/552071

Change 552072 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] qsmaxlag, send altered lag info, not only raw lag

https://gerrit.wikimedia.org/r/552072

Change 552072 merged by Addshore:
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] qsmaxlag, send altered lag info, not only raw lag

https://gerrit.wikimedia.org/r/552072

Mentioned in SAL (#wikimedia-operations) [2019-11-20T14:50:53Z] <addshore@deploy1001> Synchronized php-1.35.0-wmf.5/extensions/Wikidata.org: T221774 - Wikidata.org extension (use altered lag, not raw lag) [[gerrit:552072]] (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2019-11-20T14:52:39Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T221774 - wgWikidataOrgQueryServiceMaxLagFactor 180 [[gerrit:552069]] (duration: 00m 52s)

Now deployed and working

https://www.wikidata.org/w/api.php?action=query&titles=MediaWiki&format=json&maxlag=-1

Current maxlag for wikidata is roughly 10 due to the query service

Change 552071 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@master] qsmaxlag, send altered lag info, not only raw lag

https://gerrit.wikimedia.org/r/552071

Mentioned in SAL (#wikimedia-operations) [2019-11-20T15:36:08Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: RESYNC T221774 - wgWikidataOrgQueryServiceMaxLagFactor 180 [[gerrit:552069]] (duration: 00m 52s)

The first sync didn't seem to do the trick, but the second one did.

The discovery team suggested that perhaps median might not be the best option.
We should investigate using https://config-master.wikimedia.org/pybal/eqiad/wdqs and https://config-master.wikimedia.org/pybal/codfw/wdqs to load the lag state of ONLY pooled servers, and then we could actually use max.

Will probably do this in a separate ticket.

Mentioned in SAL (#wikimedia-operations) [2019-11-20T18:42:56Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T221774 - wgWikidataOrgQueryServiceMaxLagFactor 170 (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2019-11-20T18:51:14Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T221774 - wgWikidataOrgQueryServiceMaxLagFactor 120 (duration: 00m 54s)

Mentioned in SAL (#wikimedia-operations) [2019-11-20T18:56:43Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: RESYNC T221774 - wgWikidataOrgQueryServiceMaxLagFactor 120 (duration: 00m 50s)

Bugreporter added subscribers: Pintoch, Tpt, Magnus, Bugreporter.EditedWed, Nov 20, 7:04 PM

Currently multiple tools are broken because the time that make maxlag back to normal is much longer than the total time the tool retries edits (see this and this).

Currently multiple tools are broken because the time that make maxlag back to normal is much longer than the total time the tool retries edits (see this and this).

Good that this change works as intended? Tools should be updated to handle maxlag in a graceful manner. Retry time should not be something fixed, but something which increases if you get it or not. So 5, 10, 20, 40, 80, etc. This also makes sure that not all tools restart at the same time.

Thanks for the notification! I would be happy to release a new version of OpenRefine with a patch applied - I can do this in the coming days. The exponential back-off suggested by @Multichill makes sense intuitively - could WMDE confirm that this is the policy they recommend? Happy to adapt the policy as required.

Notable that WDQS lag seems to be paying not a blind bit of notice to this change :(

WDQS 1005-7 all at ~3 hours.

https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&panelId=8&fullscreen

Notable that WDQS lag seems to be paying not a blind bit of notice to this change :(
WDQS 1005-7 all at ~3 hours.
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&panelId=8&fullscreen

Due to the apparent differences in lag pattern between some of the nodes we need to try to get T238751 done so that we can use the max lagged server rather than the median (which we currently do).

Envlh added a subscriber: Envlh.Thu, Nov 21, 12:38 PM

So this is what we get with an exponential back-off (1.5 factor), at the moment:

22:37:27.148 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 5.4916666666667 seconds lagged. -- pausing for 1000 milliseconds. (19338ms)
22:37:28.729 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 5.4916666666667 seconds lagged. -- pausing for 1500 milliseconds. (1581ms)
22:37:33.809 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 5.4916666666667 seconds lagged. -- pausing for 2250 milliseconds. (5080ms)
22:37:37.931 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 5.4916666666667 seconds lagged. -- pausing for 3375 milliseconds. (4122ms)
22:37:42.663 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 5.4916666666667 seconds lagged. -- pausing for 5062 milliseconds. (4732ms)
22:37:49.437 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 5.4916666666667 seconds lagged. -- pausing for 7593 milliseconds. (6774ms)
22:37:58.429 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 5.4916666666667 seconds lagged. -- pausing for 11389 milliseconds. (8992ms)
22:38:18.217 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 6 seconds lagged. -- pausing for 17083 milliseconds. (19788ms)
22:38:36.461 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 6 seconds lagged. -- pausing for 25624 milliseconds. (18244ms)
22:39:05.013 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 6.4666666666667 seconds lagged. -- pausing for 38436 milliseconds. (28552ms)

So it looks like this means no OpenRefine edits at all with these new rules, in the current situation.

Bugreporter added a comment.EditedThu, Nov 21, 10:06 PM

Actually maxlag may be more than 5 second for up to one hour. tools should not break (not "give up") even no edits can be made.

@Addshore what factor are we using right now? Most bots seem to have stopped so it might be on the low side.

Change 552474 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] wgWikidataOrgQueryServiceMaxLagFactor 60

https://gerrit.wikimedia.org/r/552474

@Addshore what factor are we using right now? Most bots seem to have stopped so it might be on the low side.

I have slowly been lowering the factor, just about to set it at 60, so that 1 minute of lag = 1 second of max lag.
Once we get down this low, we should end up in a position where getting into such a bad state as we have been in over the past days / week is fairly hard / impossible.
I would like to think that the spikes of maxlag would be much smaller, but we will see, always open to changing the factor if it doesn't work so well.

As can be seen with the maxlag over the past 12 hours, it seems to be working.

This system will be much better when T238751 is also done.

Change 552474 merged by jenkins-bot:
[operations/mediawiki-config@master] wgWikidataOrgQueryServiceMaxLagFactor 60

https://gerrit.wikimedia.org/r/552474

Mentioned in SAL (#wikimedia-operations) [2019-11-22T12:32:27Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T221774 - wgWikidataOrgQueryServiceMaxLagFactor 60 (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2019-11-22T12:34:11Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T221774 - wgWikidataOrgQueryServiceMaxLagFactor 60 RESYNC (duration: 00m 51s)

Addshore closed this task as Resolved.Fri, Nov 22, 5:08 PM

Verified as working.
Work continues on T238751

One problem with the current policy (requesting all automated editing processes to use maxlag=5) is that this creates a binary threshold: either the query service lag is under the threshold, in which case bots will edit at full speed, or the query service lag is above the threshold, in which case they should all stop editing entirely. This is likely to create an oscillating behaviour, where all bots start and stop periodically. This is probably not ideal neither for the infrastructure nor for the users.

Not all automated edits are long-running tasks where stopping for an hour is acceptable: many QuickStatements or OpenRefine batches are under 10 edits for instance. As a user of these tools, I would expect them to slow down when lag is high, but stopping entirely should be extremely rare, as it breaks workflows.

I think it would be preferable to get a continuous throttling behaviour: the higher the lag, the slower the edits.
This can already be achieved by clients by using a small maxlag parameter and increasing it gradually as they retry. For instance, start with maxlag=2, which fails, retry in 1 sec with maxlag=3, which fails, retry in 4 seconds with maxlag=4, which fails, retry in 8 seconds with maxlag=5, which fails, retry in 16 seconds with maxlag=6, which succeeds.

Would such a behaviour be acceptable?

Actually, some tools seem to be doing something like that already, since edits are still going through despite max lag being above 5 for more than an hour now (Author Disambiguator does this, QuickStatements too probably, Edoderoobot too). So these tools use higher (more agressive) maxlag values than 5.

I will therefore follow their example and publish a new version of OpenRefine which does that. It might be worth updating the official recommendations about this: if the rule is never to set maxlag higher than 5, then it looks like not many are actually following it…

I think it would be preferable to get a continuous throttling behaviour: the higher the lag, the slower the edits.

Sounds sensible to me, 5 should perhaps still be the point at which to stop.
But if editing slows down when the "maxlag" is 0-5 then maybe we will stop reaching 5 in the first place :)

Actually, some tools seem to be doing something like that already, since edits are still going through despite max lag being above 5 for more than an hour now (Author Disambiguator does this, QuickStatements too probably, Edoderoobot too). So these tools use higher (more agressive) maxlag values than 5.

I can go ahead and check the values probably.
It is also possible that some bots and tools do not send max lag values.

The theory the logic of maxlag is that it is how lagged the SQL dbs are.
So, if you have a maxlag of 5, you can wait 5 seconds and in theory that lag might now be gone.
We are now abusing that somewhat by including dispatching and the query service lag which don't quite fit that model which is why factors are applied to them.

OK! If you have ways to check what sort of maxlag values are used it would be great!

I will try to implement a responsible throttling strategy in OpenRefine, hoping that others will do the same.

I will try to implement a responsible throttling strategy in OpenRefine, hoping that others will do the same.

Once this is done, if by any chance you want to share and explain how you did it, for example in a blog post, so we can share with other tool builders, it would be absolutely wonderful <3