Page MenuHomePhabricator

Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service)
Open, HighPublic

Description

Since Jan20, 2020 the wikidata-queryservice lag is repeatedly climbing over 5s, as shown in https://grafana.wikimedia.org/d/000000156/wikidata-dispatch?orgId=1&refresh=1m&fullscreen&panelId=22&from=now-7d&to=now

This delays bot runs, making their duration impredictible, which in turn makes interactive runs very hard, if not impossible

Event Timeline

Strainu created this task.Sat, Jan 25, 11:18 PM
Restricted Application added a project: Wikidata. · View Herald TranscriptSat, Jan 25, 11:18 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Aklapper renamed this task from Wikidata queryservice lag repeatedly over 5s since 1/20 to Wikidata queryservice lag repeatedly over 5s since Jan20, 2020.Sun, Jan 26, 1:57 AM
Aklapper updated the task description. (Show Details)
Dvorapa added a comment.EditedSun, Jan 26, 7:36 PM

We experience this in Pywikibot test environments and CIs for last couple of days too (T242081). I also remember I discussed this issue with someone on IRC during Christmas holidays.

Restricted Application added a project: Operations. · View Herald TranscriptSun, Jan 26, 7:48 PM
Dvorapa added a comment.EditedMon, Jan 27, 3:45 PM

Today 8-10s for the whole day, none of Pywikibot tests loading WD succeed within timeout.

I am not part of the Wikidata QS team, so I don't have answers, just questions :-D Only chiming in because my team was been tagged on this ticket- please understand we (SREs) are not in direct charge of this service and that someone else should answer with first hand knowledge.

Could you provide the documentation of where it is guaranteed that Wikidata Query Lag is going to be < 5s ? I searched and the only thing I found on a draft saying:

Seconds or even a minute or two lag seems acceptable at this point

and

As anyone is free to use this endpoint, the traffic sees a lot a variability and thus the performance of the endpoint can vary quite a lot.

Discussions at T199228, which would be a good place to provide feedback/requirements, talk about setting it on <5 minutes of lag with some number of nines, not 5 seconds.

I would like to know if WDQ team has promised <5s lag, as I would be surprised, given that canonical data storage of all wikis (mariadb), including Wikidata, is only considered lagged starting at 5-10 seconds of delay, and that would be a lower bound (blocker) before indexing and postprocessing, not including downtime.

Please note that I am not saying this is invalid or impossible, I am just asking if this should be a feature request rather than a bug report. As far I understand, an interactive, pseudo-real time query interface could be technically possible (?), but would require large architecture refactoring, but AFAIK it was not the goal of the current setup (?). Please someone with actual knowledge, feel free to correct me.

You mention:

Pywikibot test environments and CIs

But I don't see how pywikibot depending on an external service (WDQS) for CI would be a good idea, could you elaborate on what you are trying to achieve, so maybe there is a better understanding on what the requirements are? Thank you!

Addshore renamed this task from Wikidata queryservice lag repeatedly over 5s since Jan20, 2020 to Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service).Mon, Jan 27, 4:29 PM
Addshore added a project: Wikidata-Campsite.
Addshore moved this task from Incoming to Blocked / Waiting / External on the Wikidata-Campsite board.
Addshore added a subscriber: Addshore.
Dvorapa added a comment.EditedMon, Jan 27, 4:32 PM

You mention:

Pywikibot test environments and CIs

But I don't see how pywikibot depending on an external service (WDQS) for CI would be a good idea, could you elaborate on what you are trying to achieve, so maybe there is a better understanding on what the requirements are? Thank you!

Per T221774 WDQS lag is part of WD API maxlag. Thus If WDQS lag is high, API maxlag is also high, which makes Pywikibot tests (calling WD API all the time) timeout as API maxlag has been repeatedly declared to always be <5s by several people in the past. I'm not sure, how to solve the issue as I don't know, where should I look for current API maxlag promises. Where have you found WDQS lag promises?

Seconds or even a minute or two lag seems acceptable at this point

and

As anyone is free to use this endpoint, the traffic sees a lot a variability and thus the performance of the endpoint can vary quite a lot.

Indeed, the maxlag referred to in this ticket does not reflect the actual lag value of the query service.
I believe 5s maxlag for the query service = 5 mins ish or real lag.

Per T221774 WDQS lag is part of WD API maxlag. Thus If WDQS lag is high, API maxlag is also high, which makes Pywikibot tests (calling WD API all the time) timeout as API maxlag has been repeatedly declared to always be <5s by several people in the past. I'm not sure, how to solve the issue as I don't know, where should I look for current API maxlag promises. Where have you found WDQS lag promises?

Should tests be using the production api and site?
What are these tests doing?

Also relevant is T240442: Design a continuous throttling policy for Wikidata bots

(please keep in mind I don't understand the differences, all I understand is that API lag is outrageously high, probably because WDQS lag is high as those two were connected 2 months ago and since then all our tests are sometimes failing with 300s timeouts or 50min timeouts, which are two default timeouts for Pywikibot tests)

API maxlag has been repeatedly declared to always be <5s by several people in the past

That would sound about right for Mediawiki API (5-10 seconds) ...

If WDQS lag is high, API maxlag is also high

... but this sounds worrying if true- I would wait for wikidata experts to comment on that, which is new to me. That dependency does indeed look like a real problem to me, as I don't think WDQS can keep up with the same promise at the moment based on my casual observation and understanding of the architecture. Thanks for bringing this up, as this has deeper infrastructure implications due to service dependencies. While I understand the need of "slowing bot edits" in case of lag, I can see a problem making MW api and WDQS equivalent in SLA at the moment.

Where have you found WDQS lag promises?

I searched on mediawiki.org and Phabricator to see relevant discussions about expected uptime and latency & lag.

For the specific issue you are facing, I may be suggest to review SLA expectations about the api (any of it)- and timing out and erroring quickly rather than waiting in case of lag for non-interactive tests.

What are these tests doing?

They are obviously testing Pywikibot functions against several wikiproject APIs. WD API is usually asked for simple things like some maintenance category/template fro some language. Which takes 300s+ to response last weeks randomly and today the whole day.

For the specific issue you are facing, I may be suggest to review SLA expectations about the api (any of it)- and timing out and erroring quickly rather than waiting in case of lag for non-interactive tests.

Yeah, there are two options: either make them fail earlier, or make them not erroring the whole test suite because of their inresponsiveness. Or both...

I wonder if it would make sense to ignore query service lag on GET requests? Those requests shouldn’t put any kind of load on the query service, after all.

(On the other hand, that might make certain problems annoyingly difficult to debug, if the lag is sometimes there but not always.)

Its starting to feel like we should just create a better mechanism for this instead of using maxlag.
Per what I said in T240442#5815397 it might be nice to have wikibase / mediawiki tell a client how long to perhaps wait for between requests rather than indicating this maxlag and then process being to stop when it reaches 5.
This would allow the software to make the decision for reads vs writes etc and also slowly decrease rate rather than the very big jump we currently have at 5 maxlag.

Dvorapa added a subscriber: Xqt.Tue, Jan 28, 10:15 AM
jijiki triaged this task as Medium priority.Tue, Jan 28, 12:54 PM
Strainu added a comment.EditedWed, Jan 29, 7:58 AM

Should tests be using the production api and site?
What are these tests doing?

Please don't limit this bug to just tests. It is affecting normal, community-approved bot runs as well.

I would like to know if WDQ team has promised <5s lag, as I would be surprised, given that canonical data storage of all wikis (mariadb), including Wikidata, is only considered lagged starting at 5-10 seconds of delay, and that would be a lower bound (blocker) before indexing and postprocessing, not including downtime.

As others have mentioned, Pyikibot struggles with the MediaWiki lag, which apparently is linked to the WDQ one. However, in the Grafana link I provided there is a horizontal line at 5s, which suggests that 5s should be the upper limit for that particular lag.

@Addshore and others - the problem has deteriorated since Saturday - see this discussion on Wikidata: https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team/Query_Service_and_search#WDQS_lag

Pasleim added a subscriber: Pasleim.Tue, Feb 4, 7:59 PM
Xqt added a subscriber: Ladsgroup.Thu, Feb 6, 7:41 AM

Over the past weeks, we noticed a huge increase of content in Wikidata. Maybe that's something worth looking at?

Over the past weeks, we noticed a huge increase of content in Wikidata. Maybe that's something worth looking at?

Wikidata content is growing at a fast and steady pace and has been for a few years now. For the last few months it's been expanding at a rate of around 3,500,000 new pages per month. So that seems unlikely to be connected.

While I understand the need of "slowing bot edits" in case of lag, I can see a problem making MW api and WDQS equivalent in SLA at the moment.

I wonder if it would make sense to ignore query service lag on GET requests? Those requests shouldn’t put any kind of load on the query service, after all.

That resonates with me. As a bot operator I totally understand and expect that my edits might be throttled. However, it is super painful for read-only workflows, like the one I describe in T244030: all I want is read one wikipage, which can result in a 30min wait time − as it’s 'interactive', it makes for a very poor user experience.

Over the past weeks, we noticed a huge increase of content in Wikidata. Maybe that's something worth looking at?

Wikidata content is growing at a fast and steady pace and has been for a few years now. For the last few months it's been expanding at a rate of around 3,500,000 new pages per month. So that seems unlikely to be connected.

That rate is a lot higher than it was for the first 7 months of 2019, at close to or less than 1 million/month, so it could be related. But given the existing size of Wikidata, I'd call it a moderate increase, not a "huge increase", unless it's much bigger in some other metric than just number of items?

On the question of GET:

In T243701#5834792, @Lucas_Werkmeister_WMDE wrote:
I wonder if it would make sense to ignore query service lag on GET requests? Those requests shouldn’t put any kind of load on the query service, after all.

Is the idea here to split the "lag" parameter into separate ones for GET's and edit's? That makes a lot of sense to me...

Legoktm raised the priority of this task from Medium to High.Sun, Feb 9, 7:55 AM
Legoktm added a subscriber: Legoktm.

maxlag is intended to tell fully-automated bots to backoff to help servers recover in times of excess lag, and the recommended setting is maxlag=5. If the server is constantly at maxlag of >=5, then it defeats the point because bot owners will (rightly) ignore maxlag. Either the lag in WDQS needs to be fixed, or we need to introduce some scaling factor in Wikibase so that lag is usually under 5s like we have for the job queue.

Demian added a subscriber: Demian.Mon, Feb 10, 2:34 AM

Either the lag in WDQS needs to be fixed, or we need to introduce some scaling factor in Wikibase so that lag is usually under 5s like we have for the job queue.

There is a scaling factor, the actual threshold is 5 minutes I believe (wgWikidataOrgQueryServiceMaxLagFactor is 60 in mediawiki-config.git).

Yes, I think wgWikidataOrgQueryServiceMaxLagFactor should be way higher. Something like 120 or 300.

Hello all,
Here are some news: we are going to try and increase the maxlag connected to the WDQS to 15min, to see how it goes and if most of the problems you encounter still occurs. This change should be applied later this week.
Also, as it seems that the issue with accessing data comes from Pywikibot, we suggested the developers to remove this limit.

We just increased the factor to 180. If you are running a bot and still encountering issues frequently, please let us know.

I think increase the factor will not make thing better, it only increase the oscillating period. It even make query service worse (more lagged).

See https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&from=1581429584959&to=1581542357438

If the rate of edit Query Updater can handle is a constant, changing the factor will not affect the average edit rate, nor the proportion of time the lag under a specific maxlag (assuming the rate of edit over maxlag and under maxlag are constants independent of the actual maxlag).

@Bugreporter

I think increase the factor will not make thing better, it only increase the oscillating period

Yes that does seem to have happened - instead of a roughly 20 minute cycle, we now have about a 1-hour cycle.

Thanks all for your feedback. Since the change we perform didn't have the expected results, we're going to revert it today, and keep looking for sustainable solutions.

Bugreporter added a comment.EditedThu, Feb 13, 2:35 PM

The only way to resolve the issue is increase the rate of edit Query Updater can handle, or reduce the number of triple changes each edit.

The only way to resolve the issue is increase the rate of edit Query Updater can handle, or reduce the number of triple changes each edit.

Yes and no. Yes, WDQS update is our current bottleneck in the infrastructure on handling the sheer amount of edits on Wikidata (which is a good thing) but if we fix this bottleneck and make scaling possible, we will sooner or later hit another bottleneck and maxlag is a not good way to help. We need to fix how we handle maxlag.

I have been thinking about this and I think I have a suggestion. bots and tools, should respect maxlag before it reaches the threshold. We need to return the value of the maxlag in the API as well but tool developers should check the maxlag, even if it's 3 or 2 seconds, then wait for that amount and make the next edit after that (or just sleep exactly after the edit is made). This way they slow down before "they reach the threshold and then have to back off the let go back to normal and then they start the flood of edits again and so on and so forth."

What do you think?

Currently Widar use a policy of sleeping 3x+1 second if the lag (x second) is higher than one second. PetScan run batches five in parallel, so for a lag of 10 seconds PetScan make 5 edits every 31 second.

I have been thinking about this and I think I have a suggestion. bots and tools, should respect maxlag before it reaches the threshold. We need to return the value of the maxlag in the API as well but tool developers should check the maxlag, even if it's 3 or 2 seconds, then wait for that amount and make the next edit after that (or just sleep exactly after the edit is made). This way they slow down before "they reach the threshold and then have to back off the let go back to normal and then they start the flood of edits again and so on and so forth."

I'm not sure I follow, pywikibot at least has a write throttle which is 10s by default and the issue is still happening. Also, your proposal would only work if the clients would apply lagging only for write operations, not read ops, otherwise running bots would take forever.

Bugreporter added a comment.EditedFri, Feb 14, 2:05 PM

By default pywikibot will do one edit every 10 seconds (or longer if you use parallel processes). This may be overrided by setting -pt:1 (or 0) and some bots runs under this setting. Other bots may not even sleep between edits (they edit continously), though most of them still follow maxlag.

Bots does not follow maxlag at all may be blocked; the current issue is some bot does follow maxlag, but start and stop brutally. This make the lag oscillating. T240442: Design a continuous throttling policy for Wikidata bots will propose a continuous throttling policy for Wikidata bots.