Page MenuHomePhabricator

Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service)
Closed, ResolvedPublic

Description

Since Jan 20, 2020 the wikidata-queryservice lag is repeatedly climbing over 5s, as shown in https://grafana.wikimedia.org/d/000000156/wikidata-dispatch?orgId=1&refresh=1m&fullscreen&panelId=22&from=now-7d&to=now

This delays bot runs, making their duration impredictible, which in turn makes interactive runs very hard, if not impossible

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Over the past weeks, we noticed a huge increase of content in Wikidata. Maybe that's something worth looking at?

Over the past weeks, we noticed a huge increase of content in Wikidata. Maybe that's something worth looking at?

Wikidata content is growing at a fast and steady pace and has been for a few years now. For the last few months it's been expanding at a rate of around 3,500,000 new pages per month. So that seems unlikely to be connected.

While I understand the need of "slowing bot edits" in case of lag, I can see a problem making MW api and WDQS equivalent in SLA at the moment.

I wonder if it would make sense to ignore query service lag on GET requests? Those requests shouldn’t put any kind of load on the query service, after all.

That resonates with me. As a bot operator I totally understand and expect that my edits might be throttled. However, it is super painful for read-only workflows, like the one I describe in T244030: all I want is read one wikipage, which can result in a 30min wait time − as it’s 'interactive', it makes for a very poor user experience.

Over the past weeks, we noticed a huge increase of content in Wikidata. Maybe that's something worth looking at?

Wikidata content is growing at a fast and steady pace and has been for a few years now. For the last few months it's been expanding at a rate of around 3,500,000 new pages per month. So that seems unlikely to be connected.

That rate is a lot higher than it was for the first 7 months of 2019, at close to or less than 1 million/month, so it could be related. But given the existing size of Wikidata, I'd call it a moderate increase, not a "huge increase", unless it's much bigger in some other metric than just number of items?

On the question of GET:

In T243701#5834792, @Lucas_Werkmeister_WMDE wrote:
I wonder if it would make sense to ignore query service lag on GET requests? Those requests shouldn’t put any kind of load on the query service, after all.

Is the idea here to split the "lag" parameter into separate ones for GET's and edit's? That makes a lot of sense to me...

Legoktm raised the priority of this task from Medium to High.Feb 9 2020, 7:55 AM
Legoktm subscribed.

maxlag is intended to tell fully-automated bots to backoff to help servers recover in times of excess lag, and the recommended setting is maxlag=5. If the server is constantly at maxlag of >=5, then it defeats the point because bot owners will (rightly) ignore maxlag. Either the lag in WDQS needs to be fixed, or we need to introduce some scaling factor in Wikibase so that lag is usually under 5s like we have for the job queue.

Either the lag in WDQS needs to be fixed, or we need to introduce some scaling factor in Wikibase so that lag is usually under 5s like we have for the job queue.

There is a scaling factor, the actual threshold is 5 minutes I believe (wgWikidataOrgQueryServiceMaxLagFactor is 60 in mediawiki-config.git).

Yes, I think wgWikidataOrgQueryServiceMaxLagFactor should be way higher. Something like 120 or 300.

Hello all,
Here are some news: we are going to try and increase the maxlag connected to the WDQS to 15min, to see how it goes and if most of the problems you encounter still occurs. This change should be applied later this week.
Also, as it seems that the issue with accessing data comes from Pywikibot, we suggested the developers to remove this limit.

We just increased the factor to 180. If you are running a bot and still encountering issues frequently, please let us know.

I think increase the factor will not make thing better, it only increase the oscillating period. It even make query service worse (more lagged).

See https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&from=1581429584959&to=1581542357438

If the rate of edit Query Updater can handle is a constant, changing the factor will not affect the average edit rate, nor the proportion of time the lag under a specific maxlag (assuming the rate of edit over maxlag and under maxlag are constants independent of the actual maxlag).

@Bugreporter

I think increase the factor will not make thing better, it only increase the oscillating period

Yes that does seem to have happened - instead of a roughly 20 minute cycle, we now have about a 1-hour cycle.

Thanks all for your feedback. Since the change we perform didn't have the expected results, we're going to revert it today, and keep looking for sustainable solutions.

The only way to resolve the issue is increase the rate of edit Query Updater can handle, or reduce the number of triple changes each edit.

The only way to resolve the issue is increase the rate of edit Query Updater can handle, or reduce the number of triple changes each edit.

Yes and no. Yes, WDQS update is our current bottleneck in the infrastructure on handling the sheer amount of edits on Wikidata (which is a good thing) but if we fix this bottleneck and make scaling possible, we will sooner or later hit another bottleneck and maxlag is a not good way to help. We need to fix how we handle maxlag.

I have been thinking about this and I think I have a suggestion. bots and tools, should respect maxlag before it reaches the threshold. We need to return the value of the maxlag in the API as well but tool developers should check the maxlag, even if it's 3 or 2 seconds, then wait for that amount and make the next edit after that (or just sleep exactly after the edit is made). This way they slow down before "they reach the threshold and then have to back off the let go back to normal and then they start the flood of edits again and so on and so forth."

What do you think?

Currently Widar use a policy of sleeping 3x+1 second if the lag (x second) is higher than one second. PetScan run batches five in parallel, so for a lag of 10 seconds PetScan make 5 edits every 31 second.

I have been thinking about this and I think I have a suggestion. bots and tools, should respect maxlag before it reaches the threshold. We need to return the value of the maxlag in the API as well but tool developers should check the maxlag, even if it's 3 or 2 seconds, then wait for that amount and make the next edit after that (or just sleep exactly after the edit is made). This way they slow down before "they reach the threshold and then have to back off the let go back to normal and then they start the flood of edits again and so on and so forth."

I'm not sure I follow, pywikibot at least has a write throttle which is 10s by default and the issue is still happening. Also, your proposal would only work if the clients would apply lagging only for write operations, not read ops, otherwise running bots would take forever.

By default pywikibot will do one edit every 10 seconds (or longer if you use parallel processes). This may be overrided by setting -pt:1 (or 0) and some bots runs under this setting. Other bots may not even sleep between edits (they edit continously), though most of them still follow maxlag.

Bots does not follow maxlag at all may be blocked; the current issue is some bot does follow maxlag, but start and stop brutally. This make the lag oscillating. T240442: Design a continuous throttling policy for Wikidata bots will propose a continuous throttling policy for Wikidata bots.

I have been thinking about this and I think I have a suggestion. bots and tools, should respect maxlag before it reaches the threshold. We need to return the value of the maxlag in the API as well but tool developers should check the maxlag, even if it's 3 or 2 seconds, then wait for that amount and make the next edit after that (or just sleep exactly after the edit is made). This way they slow down before "they reach the threshold and then have to back off the let go back to normal and then they start the flood of edits again and so on and so forth."

What do you think?

You don't get the maxlag value except you trigger that border. What about the http response reader variable restry_after? See T210606 and T172293

This is also extremely important for OpenRefine users. We cannot push our reconciliations into Wikidata if the lag > maxlag.

The RFC takes some time to get approved and implemented, that should be the mid-term plan

Is there any short-term temporary solution until RFC gets done? Perhaps hardcode maxlag=4? Or temporarily revert WDQS inclusion into maxlag?

Is there any short-term temporary solution until RFC gets done? Perhaps hardcode maxlag=4? Or temporarily revert WDQS inclusion into maxlag?

I think this would be a decision by @Lydia_Pintscher plus we need more info from Search platform to know if the issue of WDQS is going to get fixed soon or not (or any ETA)

Is this some random spike? :D

obrazek.png (728×1 px, 67 KB)

I think this would be a decision by @Lydia_Pintscher plus we need more info from Search platform to know if the issue of WDQS is going to get fixed soon or not (or any ETA)

From the Search Platform side, we're still a few month from a robust solution to WDQS updates. And even with our proposed streaming updater, we do expect lag to climb on occasions, that's the nature of asynchronous systems. We need a long term robust way to deal with WDQS lag.

I'm wondering if exposing both MySQL lag and WDQS lag as a single value is wise. Those 2 lags represent different things and clients should be free to behave differently depending on their needs.

Indeed, currently I would only see wdqs inclusion in maxlag as a bandaid waiting for a proper solution.
We have been having some discussions on T252091 about this.

Ultimately we don't want the user to make the decision about if they should be allowed to edit or not, in the case of bots and tools that should be a server side decision.
We also don't want to block a user's indefinitely, instead just slowing processing down in general

I'm wondering if exposing both MySQL lag and WDQS lag as a single value is wise. Those 2 lags represent different things and clients should be free to behave differently depending on their needs.

I'm not 100% sure if wise or not but I want to point out that bots use this as "one number to model the stress on the infra" so they can back off if it's more then a certain threshold. It already gets the size of jobqueue and divides it by 300 and also get the dispatching lag (to client wikis) and divide it to 60 (IIRC) and picks the highest number. So in the sense of "all models are wrong but some are useful", it's a pretty polymorphic model already and I don't think the issue is that.

Adithyak1997 renamed this task from Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) to Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service).Jun 26 2020, 10:25 AM
Adithyak1997 updated the task description. (Show Details)

I, in my volunteer capacity and out of frustration, requested bot flag of two bots that each edited 50K in the past 12 hours (= 2 per second) to be revoked. That would fix the issue here. Out of interest of anyone who wants to participate: https://www.wikidata.org/wiki/Wikidata:Administrators%27_noticeboard#Request_to_revoke_bot_flags_of_Edoderoobot_and_LargeDatasetBot

Something seems to be going on very recently that's a different pattern - did something change on the infrastructure side, or is there a change in usage pattern for the last few hours? Basically maxlag (WDQS lag specifically) has NOT gone below 5 (5 minutes for WDQS) for more than 1 hour. This hasn't happened, as far as I can tell, for many days, perhaps weeks or months. Typically maxlag recovers when bots stop editing after about 20-30 minutes, sometimes it takes almost an hour, but this is the longest delay in a long time. Specifically around 2020-07-21 14:04 the lag went over 5, and as of 15:18 it's grown to over 16.
(editing) finally at 15:35 it's coming down again (down to about 12 now).

I haven't been able to do any edits through Open Refine, and in https://phabricator.wikimedia.org/T255344 it was mentioned it could be due to the server lag.

I haven't uploaded this way before and am therefore trying a test-set with just 20 item that need to be newly created. I have restarted Open Refine several times yesterday evening and today during the day again. It just keeps 0% for 1-2 hours, after which I decide to shut down the upload.

According to @Pintoch it is not a problem with OR, and I am logged in to Wikidata.
Looking for a solution, he directed me here.

Are there some tips you could give me?

Same problem here. Using Open Refine to edit Wikidata didn't work today when I tried to do a small ingest of 5 items (2 existing, 3 new).

Copying the code generated in Open Refine to QuickStatements 1 worked for the addition of claims to the 2 existing items; but created dozens of new items without statements for the 3 new items. - I have no idea whether this latter issue is related in any way to this ticket.

It seems like since 8/25, WD maxlag has rarely approached 5 seconds (graph) which is substantially better than before.

Lydia_Pintscher claimed this task.

Yeah let's call this resolved. Additional work is ongoing around improving the updater of the query service that will further improve the situation.