Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Strainu
	Jan 25 2020, 11:18 PM

Description

Since Jan 20, 2020 the wikidata-queryservice lag is repeatedly climbing over 5s, as shown in https://grafana.wikimedia.org/d/000000156/wikidata-dispatch?orgId=1&refresh=1m&fullscreen&panelId=22&from=now-7d&to=now

This delays bot runs, making their duration impredictible, which in turn makes interactive runs very hard, if not impossible

Related Objects

Mentioned In: T293886: Remove Wikidata query service lag from Wikidata maxlag
T255344: Openrefine: problems logging in
T252091: RFC: Site-wide edit rate limiting with PoolCounter
T249622: Make separate maxlag config value for Wikidata until T243701 solved
T245144: Increase Retry-After header for Wikidata
T244030: inteGraality queries feature is painfully slow, potentially pywikibot related
T242081: Pywikibot fails to access Wikidata due to high maxlag lately
T221774: Add Wikidata query service lag to Wikidata maxlag
Mentioned Here: T258354: remove noratelimit from bot group for Wikidata
T255344: Openrefine: problems logging in
T252091: RFC: Site-wide edit rate limiting with PoolCounter
T242453: Detect and alert and/or remediate Blazegraph deadlocks
T238751: Only generate maxlag from pooled query service servers.
T245144: Increase Retry-After header for Wikidata
T247459: Write RFC about site-wide edit rate limiting
T172293: Add Retry-After header when a rate limit has been triggered
T210606: retry-after value retrieved from http header looks wrong
T244030: inteGraality queries feature is painfully slow, potentially pywikibot related
T240442: Design a continuous throttling policy for Wikidata bots
T221774: Add Wikidata query service lag to Wikidata maxlag
T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it
T242081: Pywikibot fails to access Wikidata due to high maxlag lately

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Addshore and others - the problem has deteriorated since Saturday - see this discussion on Wikidata: https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team/Query_Service_and_search#WDQS_lag

Lea_Lacroix_WMDE added a subscriber: Gehel.Feb 4 2020, 5:29 PM

Lea_Lacroix_WMDE subscribed.

Pasleim subscribed.Feb 4 2020, 7:59 PM

JeanFred added a subtask: T244030: inteGraality queries feature is painfully slow, potentially pywikibot related.Feb 5 2020, 12:28 PM

JeanFred subscribed.Feb 5 2020, 1:09 PM

Alicia_Fagerving_WMSE subscribed.Feb 5 2020, 3:42 PM

Dvorapa added a parent task: T242081: Pywikibot fails to access Wikidata due to high maxlag lately.Feb 5 2020, 11:40 PM

Dvorapa removed a parent task: T242081: Pywikibot fails to access Wikidata due to high maxlag lately.

Xqt added a subscriber: Ladsgroup.Feb 6 2020, 7:41 AM

Over the past weeks, we noticed a huge increase of content in Wikidata. Maybe that's something worth looking at?

In T243701#5855352, @Lea_Lacroix_WMDE wrote:

Over the past weeks, we noticed a huge increase of content in Wikidata. Maybe that's something worth looking at?

Wikidata content is growing at a fast and steady pace and has been for a few years now. For the last few months it's been expanding at a rate of around 3,500,000 new pages per month. So that seems unlikely to be connected.

JeanFred removed a subtask: T244030: inteGraality queries feature is painfully slow, potentially pywikibot related.Feb 6 2020, 10:12 AM

In T243701#5834751, @jcrespo wrote:

While I understand the need of "slowing bot edits" in case of lag, I can see a problem making MW api and WDQS equivalent in SLA at the moment.

In T243701#5834792, @Lucas_Werkmeister_WMDE wrote:

I wonder if it would make sense to ignore query service lag on GET requests? Those requests shouldn’t put any kind of load on the query service, after all.

That resonates with me. As a bot operator I totally understand and expect that my edits might be throttled. However, it is super painful for read-only workflows, like the one I describe in T244030: all I want is read one wikipage, which can result in a 30min wait time − as it’s 'interactive', it makes for a very poor user experience.

JeanFred mentioned this in T244030: inteGraality queries feature is painfully slow, potentially pywikibot related.Feb 6 2020, 10:44 AM

Aschroet subscribed.Feb 6 2020, 3:20 PM

In T243701#5855439, @ArielGlenn wrote:

In T243701#5855352, @Lea_Lacroix_WMDE wrote:

Over the past weeks, we noticed a huge increase of content in Wikidata. Maybe that's something worth looking at?

Wikidata content is growing at a fast and steady pace and has been for a few years now. For the last few months it's been expanding at a rate of around 3,500,000 new pages per month. So that seems unlikely to be connected.

That rate is a lot higher than it was for the first 7 months of 2019, at close to or less than 1 million/month, so it could be related. But given the existing size of Wikidata, I'd call it a moderate increase, not a "huge increase", unless it's much bigger in some other metric than just number of items?

On the question of GET:

In T243701#5834792, @Lucas_Werkmeister_WMDE wrote:
I wonder if it would make sense to ignore query service lag on GET requests? Those requests shouldn’t put any kind of load on the query service, after all.

Is the idea here to split the "lag" parameter into separate ones for GET's and edit's? That makes a lot of sense to me...

maxlag is intended to tell fully-automated bots to backoff to help servers recover in times of excess lag, and the recommended setting is maxlag=5. If the server is constantly at maxlag of >=5, then it defeats the point because bot owners will (rightly) ignore maxlag. Either the lag in WDQS needs to be fixed, or we need to introduce some scaling factor in Wikibase so that lag is usually under 5s like we have for the job queue.

• Demian subscribed.Feb 10 2020, 2:34 AM

Either the lag in WDQS needs to be fixed, or we need to introduce some scaling factor in Wikibase so that lag is usually under 5s like we have for the job queue.

There is a scaling factor, the actual threshold is 5 minutes I believe (wgWikidataOrgQueryServiceMaxLagFactor is 60 in mediawiki-config.git).

Yes, I think wgWikidataOrgQueryServiceMaxLagFactor should be way higher. Something like 120 or 300.

Hello all,
Here are some news: we are going to try and increase the maxlag connected to the WDQS to 15min, to see how it goes and if most of the problems you encounter still occurs. This change should be applied later this week.
Also, as it seems that the issue with accessing data comes from Pywikibot, we suggested the developers to remove this limit.

We just increased the factor to 180. If you are running a bot and still encountering issues frequently, please let us know.

I think increase the factor will not make thing better, it only increase the oscillating period. It even make query service worse (more lagged).

See https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&from=1581429584959&to=1581542357438

If the rate of edit Query Updater can handle is a constant, changing the factor will not affect the average edit rate, nor the proportion of time the lag under a specific maxlag (assuming the rate of edit over maxlag and under maxlag are constants independent of the actual maxlag).

@Bugreporter

I think increase the factor will not make thing better, it only increase the oscillating period

Yes that does seem to have happened - instead of a roughly 20 minute cycle, we now have about a 1-hour cycle.

Thanks all for your feedback. Since the change we perform didn't have the expected results, we're going to revert it today, and keep looking for sustainable solutions.

Urbanecm mentioned this in T245144: Increase Retry-After header for Wikidata.Feb 13 2020, 12:33 PM

The only way to resolve the issue is increase the rate of edit Query Updater can handle, or reduce the number of triple changes each edit.

In T243701#5880988, @Bugreporter wrote:

The only way to resolve the issue is increase the rate of edit Query Updater can handle, or reduce the number of triple changes each edit.

Yes and no. Yes, WDQS update is our current bottleneck in the infrastructure on handling the sheer amount of edits on Wikidata (which is a good thing) but if we fix this bottleneck and make scaling possible, we will sooner or later hit another bottleneck and maxlag is a not good way to help. We need to fix how we handle maxlag.

I have been thinking about this and I think I have a suggestion. bots and tools, should respect maxlag before it reaches the threshold. We need to return the value of the maxlag in the API as well but tool developers should check the maxlag, even if it's 3 or 2 seconds, then wait for that amount and make the next edit after that (or just sleep exactly after the edit is made). This way they slow down before "they reach the threshold and then have to back off the let go back to normal and then they start the flood of edits again and so on and so forth."

What do you think?

Currently Widar use a policy of sleeping 3x+1 second if the lag (x second) is higher than one second. PetScan run batches five in parallel, so for a lag of 10 seconds PetScan make 5 edits every 31 second.

In T243701#5884801, @Ladsgroup wrote:

I have been thinking about this and I think I have a suggestion. bots and tools, should respect maxlag before it reaches the threshold. We need to return the value of the maxlag in the API as well but tool developers should check the maxlag, even if it's 3 or 2 seconds, then wait for that amount and make the next edit after that (or just sleep exactly after the edit is made). This way they slow down before "they reach the threshold and then have to back off the let go back to normal and then they start the flood of edits again and so on and so forth."

I'm not sure I follow, pywikibot at least has a write throttle which is 10s by default and the issue is still happening. Also, your proposal would only work if the clients would apply lagging only for write operations, not read ops, otherwise running bots would take forever.

By default pywikibot will do one edit every 10 seconds (or longer if you use parallel processes). This may be overrided by setting -pt:1 (or 0) and some bots runs under this setting. Other bots may not even sleep between edits (they edit continously), though most of them still follow maxlag.

Bots does not follow maxlag at all may be blocked; the current issue is some bot does follow maxlag, but start and stop brutally. This make the lag oscillating. T240442: Design a continuous throttling policy for Wikidata bots will propose a continuous throttling policy for Wikidata bots.

SilentSpike subscribed.Mar 4 2020, 11:48 AM

Xqt added a subscriber: valhallasw.Mar 4 2020, 2:40 PM

In T243701#5884801, @Ladsgroup wrote:

I have been thinking about this and I think I have a suggestion. bots and tools, should respect maxlag before it reaches the threshold. We need to return the value of the maxlag in the API as well but tool developers should check the maxlag, even if it's 3 or 2 seconds, then wait for that amount and make the next edit after that (or just sleep exactly after the edit is made). This way they slow down before "they reach the threshold and then have to back off the let go back to normal and then they start the flood of edits again and so on and so forth."

What do you think?

You don't get the maxlag value except you trigger that border. What about the http response reader variable restry_after? See T210606 and T172293

William_Avery subscribed.Mar 15 2020, 1:02 PM

William_Avery awarded a token.Mar 15 2020, 1:08 PM

Any news? From possible solutions like T238751, T240442, T245144 and @Ladsgroup's T247459?

Last 48 hours: https://grafana.wikimedia.org/d/000000156/wikidata-dispatch?orgId=1&fullscreen&panelId=22&from=1584183600000&to=1584421199000

This is also extremely important for OpenRefine users. We cannot push our reconciliations into Wikidata if the lag > maxlag.

In T243701#5973374, @Dvorapa wrote:

Any news? From possible solutions like T238751, T240442, T245144 and @Ladsgroup's T247459?

Last 48 hours: https://grafana.wikimedia.org/d/000000156/wikidata-dispatch?orgId=1&fullscreen&panelId=22&from=now-2d&to=now

The RFC takes some time to get approved and implemented, that should be the mid-term plan

Is there any short-term temporary solution until RFC gets done? Perhaps hardcode maxlag=4? Or temporarily revert WDQS inclusion into maxlag?

In T243701#5985113, @Dvorapa wrote:

Is there any short-term temporary solution until RFC gets done? Perhaps hardcode maxlag=4? Or temporarily revert WDQS inclusion into maxlag?

I think this would be a decision by @Lydia_Pintscher plus we need more info from Search platform to know if the issue of WDQS is going to get fixed soon or not (or any ETA)

Jony subscribed.Mar 22 2020, 7:30 AM

THE_IT subscribed.Apr 2 2020, 10:02 AM

Dvorapa mentioned this in T249622: Make separate maxlag config value for Wikidata until T243701 solved.Apr 7 2020, 3:20 PM

Dvorapa awarded a token.Apr 7 2020, 3:40 PM

Gehel moved this task from Incoming to Wikidata / Wikibase on the Wikidata-Query-Service board.Apr 8 2020, 12:54 PM

• ema moved this task from Backlog to Radar/Not for service by Traffic on the Traffic board.Apr 14 2020, 7:18 AM

JJMC89 added a parent task: T242081: Pywikibot fails to access Wikidata due to high maxlag lately.Apr 18 2020, 5:45 AM

Dvorapa removed a parent task: T242081: Pywikibot fails to access Wikidata due to high maxlag lately.Apr 18 2020, 8:50 AM

Masti subscribed.Apr 18 2020, 10:27 AM

Is this some random spike? :D

T242453 is related.

matej_suchanek awarded a token.Apr 18 2020, 12:59 PM

PeterBowman subscribed.Apr 19 2020, 12:48 PM

Ladsgroup mentioned this in T252091: RFC: Site-wide edit rate limiting with PoolCounter.May 7 2020, 1:37 AM

revi awarded a token.May 16 2020, 3:55 PM

In T243701#5985617, @Ladsgroup wrote:

I think this would be a decision by @Lydia_Pintscher plus we need more info from Search platform to know if the issue of WDQS is going to get fixed soon or not (or any ETA)

From the Search Platform side, we're still a few month from a robust solution to WDQS updates. And even with our proposed streaming updater, we do expect lag to climb on occasions, that's the nature of asynchronous systems. We need a long term robust way to deal with WDQS lag.

I'm wondering if exposing both MySQL lag and WDQS lag as a single value is wise. Those 2 lags represent different things and clients should be free to behave differently depending on their needs.

Indeed, currently I would only see wdqs inclusion in maxlag as a bandaid waiting for a proper solution.
We have been having some discussions on T252091 about this.

Ultimately we don't want the user to make the decision about if they should be allowed to edit or not, in the case of bots and tools that should be a server side decision.
We also don't want to block a user's indefinitely, instead just slowing processing down in general

In T243701#6152282, @Gehel wrote:

I'm wondering if exposing both MySQL lag and WDQS lag as a single value is wise. Those 2 lags represent different things and clients should be free to behave differently depending on their needs.

I'm not 100% sure if wise or not but I want to point out that bots use this as "one number to model the stress on the infra" so they can back off if it's more then a certain threshold. It already gets the size of jobqueue and divides it by 300 and also get the dispatching lag (to client wikis) and divide it to 60 (IIRC) and picks the highest number. So in the sense of "all models are wrong but some are useful", it's a pretty polymorphic model already and I don't think the issue is that.

taavi subscribed.May 22 2020, 8:03 PM

Multichill subscribed.May 22 2020, 9:02 PM

tfmorris subscribed.May 25 2020, 2:12 AM

Adithyak1997 renamed this task from Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) to Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service).Jun 26 2020, 10:25 AM

Adithyak1997 updated the task description. (Show Details)

I, in my volunteer capacity and out of frustration, requested bot flag of two bots that each edited 50K in the past 12 hours (= 2 per second) to be revoked. That would fix the issue here. Out of interest of anyone who wants to participate: https://www.wikidata.org/wiki/Wikidata:Administrators%27_noticeboard#Request_to_revoke_bot_flags_of_Edoderoobot_and_LargeDatasetBot

Something seems to be going on very recently that's a different pattern - did something change on the infrastructure side, or is there a change in usage pattern for the last few hours? Basically maxlag (WDQS lag specifically) has NOT gone below 5 (5 minutes for WDQS) for more than 1 hour. This hasn't happened, as far as I can tell, for many days, perhaps weeks or months. Typically maxlag recovers when bots stop editing after about 20-30 minutes, sometimes it takes almost an hour, but this is the longest delay in a long time. Specifically around 2020-07-21 14:04 the lag went over 5, and as of 15:18 it's grown to over 16.
(editing) finally at 15:35 it's coming down again (down to about 12 now).

Pintoch mentioned this in T255344: Openrefine: problems logging in.Aug 2 2020, 12:14 PM

I haven't been able to do any edits through Open Refine, and in https://phabricator.wikimedia.org/T255344 it was mentioned it could be due to the server lag.

I haven't uploaded this way before and am therefore trying a test-set with just 20 item that need to be newly created. I have restarted Open Refine several times yesterday evening and today during the day again. It just keeps 0% for 1-2 hours, after which I decide to shut down the upload.

According to @Pintoch it is not a problem with OR, and I am logged in to Wikidata.
Looking for a solution, he directed me here.

Are there some tips you could give me?

Same problem here. Using Open Refine to edit Wikidata didn't work today when I tried to do a small ingest of 5 items (2 existing, 3 new).

Copying the code generated in Open Refine to QuickStatements 1 worked for the addition of claims to the 2 existing items; but created dozens of new items without statements for the 3 new items. - I have no idea whether this latter issue is related in any way to this ticket.

Pintoch unsubscribed.Aug 22 2020, 2:43 PM

CBogen moved this task from Wikidata / Wikibase to Incoming on the Wikidata-Query-Service board.Aug 24 2020, 12:53 PM

CBogen moved this task from Incoming to Wikidata / Wikibase on the Wikidata-Query-Service board.Aug 24 2020, 3:46 PM

Alicia_Fagerving_WMSE unsubscribed.Aug 25 2020, 2:52 PM

It seems like since 8/25, WD maxlag has rarely approached 5 seconds (graph) which is substantially better than before.

@Lydia_Pintscher with the T258354: remove noratelimit from bot group for Wikidata being done, Can this be called done now?

Yeah let's call this resolved. Additional work is ongoing around improving the updater of the query service that will further improve the situation.

dcausse mentioned this in T293886: Remove Wikidata query service lag from Wikidata maxlag.Oct 20 2021, 12:57 PM

	F31763816: obrazek.png
	Apr 18 2020, 12:48 PM

Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service)Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service)
Closed, ResolvedPublic
Actions