Page MenuHomePhabricator

Add Wikidata query service lag to Wikidata maxlag
Open, HighPublic8 Story Points

Description

As a high-volume editor on Wikidata, I want to ensure that my edits do not impair query service responsiveness.

Problem:
maxlag is a parameter that API users can specify to avoid overloading the wiki: if I send an API request with maxlag=5, and the database replicas are currently more than five seconds behind the master, then MediaWiki will immediately refuse the request. Afterwards, I’m supposed to wait for a bit before retrying the request. See https://www.mediawiki.org/wiki/Manual:Maxlag_parameter.

Last year, we modified the API’s behavior so that this takes into account not just the replication lag, but also the dispatch lag (T194950: Include Wikibase dispatch lag in API "maxlag" enforcing) – if the database replicas are fine, but change dispatching to client wikis is more than 5 minutes behind, then requests with maxlag=5 will still be rejected. (The dispatchLagToMaxLagFactor is configurable, 60 in production, so the threshold for dispatch lag is in minutes instead of seconds.)

However, this does not take the query service lag into account – if updates on some or all of the WDQS servers start to lag behind, edits will continue at full speed as long as database replication and client dispatching are not affected. This can happen because query service lag depends not just on edit rate but also on the size of the entities edited (on each edit, the full entity is reloaded, even if only a small part of it was edited, so editing large items has a disproportionate impact) and the rate of external queries against the server.

BDD
GIVEN all the WDQS servers are lagged by more than one hour
WHEN I send a wbeditentity API request
AND I set the maxlag parameter to 5 seconds
THEN I should get a maxlag error
AND no edit should be made

(the GIVEN part describes a rather extreme case; once the open questions below are qualified, it can perhaps be changed to a more realistic case)

Acceptance criteria:

  • the effective max lag takes query service update lag into account

Open questions:

  • What should the conversion factor be? (For dispatch lag, it’s 60 – five seconds of replication lag are equivalent to five minutes of dispatch lag.)
  • Lag between different servers can differ significantly. Do we use the mean lag? The median? The maximum? Something else? (For dispatch lag, we seem to use the median.)
  • maxlag affects all API requests, even ones that shouldn’t have any effect on query service lag, such as action=wbgetentity or action=query&meta=userinfo. Should we try to limit the impact of this change, e. g. by only using query service lag on POST requests? (On the other hand, the same question should apply to dispatch lag and we don’t seem to limit the impact of that as far as I can tell.)

Break-down

  • Adding query service lag to wikidata maxlag - main change
  • Add configuration option for Wikibase to take query service update lag into account

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 24 2019, 2:14 PM

[Implementing this task] could be slightly tricky.
Right now the options for getting the maxlag are querying each server individually, or querying prometheus.
Also, waiting on a external service to return a result for the maxlag before performing an action might take too long.
A maxlag for the wdqs machines could be stored in some cache for checking and updater periodically, perhaps by a DeferrableUpdate or something similar?

Lucas_Werkmeister_WMDE updated the task description. (Show Details)

What should the conversion factor be? (For dispatch lag, it’s 60 – five seconds of replication lag are equivalent to five minutes of dispatch lag.)

Based on the last three months, 5 minutes seems fine but we could go for 10 if needed.

Lag between different servers can differ significantly. Do we use the mean lag? The median? The maximum? Something else? (For dispatch lag, we seem to use the median.)

As seen in the previously linked graph, there are some differences yes. I would at least ignore the ones under 1 minute, the median of the ones above that should give a good indication.

maxlag affects all API requests, even ones that shouldn’t have any effect on query service lag, such as action=wbgetentity or action=query&meta=userinfo. Should we try to limit the impact of this change, e. g. by only using query service lag on POST requests? (On the other hand, the same question should apply to dispatch lag and we don’t seem to limit the impact of that as far as I can tell.)

(I realized yesterday why this isn’t as big a deal as I thought – API users could decide to only specify maxlag on POST requests themselves. Though I suppose we shouldn’t encourage that, because in the case of replication lag you really want all requests to be throttled.)

Lucas_Werkmeister_WMDE set the point value for this task to 8.

Let’s start with 10 minutes (factor of 120) and lower it to 5 (60) if necessary. And use the median lag of all the servers for now.

hoo claimed this task.May 3 2019, 12:00 PM
hoo added a subscriber: hoo.

Possible way to do this:

Create PrometheusBlazegraphLagService class which internally fetches the lag from a given Blazegraph instance like curl "http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query=scalar(time()%20-%20blazegraph_lastupdated%7Binstance%3D%22wdqs1005.eqiad.wmnet%3A9193%22%7D)" (where wdqs1005.eqiad.wmnet is to be replaced by the hostname). That would be cached (given we don't want to hit Prometheus often and as we care for lag in the 30-60m range, fetching this once ever 1-5m should be fine… this could maybe even be done in a Job). We would do that for all known wdqs instances and then sum/average/… the results.

This value would then be used for adjusting maxlag, as described above.

Things to consider:

  • Is going to Prometheus directly the right thing to do?
  • How often can we sanely hit Prometheus?
  • Where do we want to manage the list of WDQS instance for this? (Or I guess can we also ask Prometheus for all metrics at once?)
hoo added a subscriber: fgiunchedi.May 3 2019, 12:02 PM
hoo added a subscriber: Smalyshev.May 6 2019, 11:00 AM

Alternative approach:

Ask the WDQS instances directly by using MediaWiki's SparqlClient with SELECT * WHERE { <http://www.wikidata.org> schema:dateModified ?y }. That would still require caching, probably doing all of this in a job and unfortunately we would still need a list of all WDQS instances we care about.

@Smalyshev @fgiunchedi What do you think? Go through Prometheus or ask the instances individually or …?

A very minor note: WDQS depends heavily on mediawiki nodes to function, making mediawiki apps to depend on WDQS (which we already has done in WikibaseQualityConstraints) will make a cyclic dependency that bit Wikipedia a lot with ores extension and ores service. Overall, it's not an issue but every dependency should be as loose as possible, if the wdqs didn't respond or timed out and errored out for whatever reason, gracefully ignore.

Smalyshev added a comment.EditedMay 7 2019, 5:11 PM

Go through Prometheus or ask the instances individually or …?

prometheus & icinga know what is the lag for each server, so I think it's better to ask them? I am not sure if we have good support for scheduled recurrent jobs on Wiki - if we do, we could just make a script that asks prometheus and puts the data into the Wikidata DB somewhere.

if the wdqs didn't respond or timed out and errored out for whatever reason, gracefully ignore.

Yes, this should happen anyway - if WDQS does not respond, there should be graceful failure route, and internal functions should use sane timeouts for each specific function.

hoo added a comment.EditedMay 8 2019, 12:22 PM

@Smalyshev Do you think it would be enough to look at http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query=blazegraph_lastupdated and http://prometheus.svc.codfw.wmnet/ops/api/v1/query?query=blazegraph_lastupdated (no matter which DC MW is running in) and just take all servers into account? Or do we need a whitelist/blacklist (or both) or some other mechanism to make sure we don't eg. take servers into account that are being maintained.

If we use median to aggregate the lags (or maybe even a higher percentile?) we might have robust enough results even if a few servers are in maintenance?

Or maybe filter on cluster?

Probably prometheus needs to define a discovery url: "http://prometheus.discovery.wmnet" (similar to http://ores.discovery.wmnet) @fgiunchedi knows better though.

do we need a whitelist/blacklist (or both) or some other mechanism to make sure we don't eg. take servers into account that are being maintained.

Ideally, only servers in the public pool, or maybe public+internal pool, should be taken into account, this would exclude maintained servers (as they are depooled for maintenance) and test servers (as they wouldn't be in the pool). But I am not sure whether it's possible or how hard it is to do something like that. Somebody from Operations probably should know more about these.

Possible way to do this:
Create PrometheusBlazegraphLagService class which internally fetches the lag from a given Blazegraph instance like curl "http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query=scalar(time()%20-%20blazegraph_lastupdated%7Binstance%3D%22wdqs1005.eqiad.wmnet%3A9193%22%7D)" (where wdqs1005.eqiad.wmnet is to be replaced by the hostname). That would be cached (given we don't want to hit Prometheus often and as we care for lag in the 30-60m range, fetching this once ever 1-5m should be fine… this could maybe even be done in a Job). We would do that for all known wdqs instances and then sum/average/… the results.
This value would then be used for adjusting maxlag, as described above.
Things to consider:

  • Is going to Prometheus directly the right thing to do?
  • How often can we sanely hit Prometheus?
  • Where do we want to manage the list of WDQS instance for this? (Or I guess can we also ask Prometheus for all metrics at once?)

re: frequency even once a minute would be fine, since the query isn't heavy to run, and yes you can ask about all instances at once, or e.g. take the max().

Which leads me to a question re: servers in maintenance, where / how is the list maintained or will be maintained of all instances and/or instances in maintenance? I'm asking because if the list of instances that should be queried is known anyways IMHO it'd be simpler to query the lag via sparql and keep the prometheus out of the loop entirely. I'm saying this because IIRC the "lastupdated" value would go blazegraph -> prometheus -> mediawiki

hoo added a comment.EditedMay 13 2019, 3:03 PM

@fgiunchedi @Smalyshev How about I do something like this:

$result = json_decode(file_get_contents('http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query=blazegraph_lastupdated'), true)['data']['result'];
foreach ( $result as $resultByInstance ) {
	echo $resultByInstance['metric']['instance'] . ' (' .$resultByInstance['metric']['cluster'] . ') has been updated at ' . $resultByInstance['value'][1] . PHP_EOL;
}
wdqs1004:9193 (wdqs) has been updated at 1557755605
wdqs1005:9193 (wdqs) has been updated at 1557755946
wdqs1006:9193 (wdqs) has been updated at 1557758395
wdqs1003:9193 (wdqs-internal) has been updated at 1557758386
wdqs1007:9193 (wdqs-internal) has been updated at 1557758374
wdqs1008:9193 (wdqs-internal) has been updated at 1557758353
wdqs1009:9193 (wdqs-test) has been updated at 1557758403
wdqs1010:9193 (wdqs-test) has been updated at 1557758409

Now if we do this once per DC and filter with a list of clusters we care about via config ("wdqs" and "wdqs-internal" probably), would that work out?

Note that wdqs-test should not be included in maxlag at any case. These are test servers, which can be lagged significantly due to any number of reasons. Also, I'm not sure how easy it is to see if the server is in the pool, but it'd be nice if it was possible to exclude depooled servers, because this usually means the server is under maintenance.

Change 512393 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Add Wikidata query service lag to Wikidata maxlag

https://gerrit.wikimedia.org/r/512393

Addshore moved this task from incoming to in progress on the Wikidata board.Jun 21 2019, 11:25 PM

@maho re feedback on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/512393

Do we even want to do this via Prometheus (WikimediaPrometheusSparqlEndpointReplicationStatus)

Sounds like the right thing to do so to me too, as per https://phabricator.wikimedia.org/T221774#5165165. Warning though, I have very limited knowledge so far on what Prometheus is setup to monitor in our stack. Links are welcome :)

If I understand correctly, it's main and probably only downside is when Prometheus has outdated data on the lag for some reason (if WDQS instance is down/depooled? or the refresh job fails often?) .. which is the same scenario that raises your last questions, I believe:

Shall we disregard lag data if it's really old

I think here I'd rather keep counting that lag. It is at the end the latest known lag to the user querying the lag (last refresh, refresh rate and interval are implementation details). And to this implementation detail, if the latest lag that is known is much farther in the past than I expected it to be, then I should for safety sake go with the assumption that the instance is actually lagging at least that much rather than assuming than risking ignoring that lag, which might in edge cases result in processing the edit instead of rejecting it due to lag.

Does that sound reasonable? or did I misunderstood too much in there?

If yes, where should WikimediaPrometheusSparqlEndpointReplicationStatus live? Here or in the Wikidata.org extension?

That thing is definitely better to live outside of Wikibase codebase, if possible, in an extension/external library that is more Wikidata. is there such place? what is "Wikidata.org extension"? and did you want to name it WikidataPrometheusSparqlEndpointReplicationStatus (Wikidata instead of Wikimedia), or why Wikimedia?

I'm also wondering if the whole thing (interface and other implmenetation SparqlEndpointReplicationStatusStateHandler etc) can as well live outside Wikibase Repo? (I'm assuming Wikibase Repo has no code related to WDQS but I can certainly be off here, haven't checked yet).

Do we (also?) want to have a pure SPARQL implementation?

In case we do Prometheus based solution, then I wouldn't keep two implementations unless I want to use one as a fallback for the other one. Are you thinking of using SPARQL implementation as a fallback in case Prometheus is not reachable?

hoo added a comment.Jul 1 2019, 4:46 PM

@maho re feedback on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/512393

Do we even want to do this via Prometheus (WikimediaPrometheusSparqlEndpointReplicationStatus)

Sounds like the right thing to do so to me too, as per https://phabricator.wikimedia.org/T221774#5165165. Warning though, I have very limited knowledge so far on what Prometheus is setup to monitor in our stack. Links are welcome :)
If I understand correctly, it's main and probably only downside is when Prometheus has outdated data on the lag for some reason (if WDQS instance is down/depooled? or the refresh job fails often?) .. which is the same scenario that raises your last questions, I believe:

Shall we disregard lag data if it's really old

I think here I'd rather keep counting that lag. It is at the end the latest known lag to the user querying the lag (last refresh, refresh rate and interval are implementation details). And to this implementation detail, if the latest lag that is known is much farther in the past than I expected it to be, then I should for safety sake go with the assumption that the instance is actually lagging at least that much rather than assuming than risking ignoring that lag, which might in edge cases result in processing the edit instead of rejecting it due to lag.
Does that sound reasonable? or did I misunderstood too much in there?

Yeah, that sounds reasonable.

If yes, where should WikimediaPrometheusSparqlEndpointReplicationStatus live? Here or in the Wikidata.org extension?

That thing is definitely better to live outside of Wikibase codebase, if possible, in an extension/external library that is more Wikidata. is there such place? what is "Wikidata.org extension"? and did you want to name it WikidataPrometheusSparqlEndpointReplicationStatus (Wikidata instead of Wikimedia), or why Wikimedia?
I'm also wondering if the whole thing (interface and other implmenetation SparqlEndpointReplicationStatusStateHandler etc) can as well live outside Wikibase Repo? (I'm assuming Wikibase Repo has no code related to WDQS but I can certainly be off here, haven't checked yet).

We could create a small hook in Wikibase that lets us inject arbitrary things into max lag… and then in Wikidata.org do all the interaction with Prometheus and inject the lag.

Do we (also?) want to have a pure SPARQL implementation?

In case we do Prometheus based solution, then I wouldn't keep two implementations unless I want to use one as a fallback for the other one. Are you thinking of using SPARQL implementation as a fallback in case Prometheus is not reachable?

I mostly thought that might be useful for third party instance that only run one-instance query services... but it might be better to stick with YAGNI and just implement this in Wikidata.org for Wikidata.

If yes, where should WikimediaPrometheusSparqlEndpointReplicationStatus live? Here or in the Wikidata.org extension?

That thing is definitely better to live outside of Wikibase codebase, if possible, in an extension/external library that is more Wikidata. is there such place? what is "Wikidata.org extension"? and did you want to name it WikidataPrometheusSparqlEndpointReplicationStatus (Wikidata instead of Wikimedia), or why Wikimedia?
I'm also wondering if the whole thing (interface and other implmenetation SparqlEndpointReplicationStatusStateHandler etc) can as well live outside Wikibase Repo? (I'm assuming Wikibase Repo has no code related to WDQS but I can certainly be off here, haven't checked yet).

We could create a small hook in Wikibase that lets us inject arbitrary things into max lag… and then in Wikidata.org do all the interaction with Prometheus and inject the lag.

There already is a hook for it, that’s how Wikibase does it :) maxlag is managed by core, after all.

replies to 2 comments inline

That thing is definitely better to live outside of Wikibase codebase, if possible, in an extension/external library that is more Wikidata. is there such place? what is "Wikidata.org extension"? and did you want to name it WikidataPrometheusSparqlEndpointReplicationStatus (Wikidata instead of Wikimedia), or why Wikimedia?

The Wikidata.org extension might make sense.
BUT
Once SDOC and commons starts sending data to the wdqs then we will also need this code to be on commons.
the Wikidata.org extension is only on wikidata.org.

There is also the WikimediaEvents extension, that by name this doesnt totally fit into, but might be the "right" place for now, and conditionally run for wikibases hooked up to a query service.

I'm also wondering if the whole thing (interface and other implmenetation SparqlEndpointReplicationStatusStateHandler etc) can as well live outside Wikibase Repo? (I'm assuming Wikibase Repo has no code related to WDQS but I can certainly be off here, haven't checked yet).

Do we (also?) want to have a pure SPARQL implementation?

In case we do Prometheus based solution, then I wouldn't keep two implementations unless I want to use one as a fallback for the other one. Are you thinking of using SPARQL implementation as a fallback in case Prometheus is not reachable?

Thinking about this more I think this code could live just fine in Wikibase.get itself, and just be configurable.
There is no reason that over users of wikibase and the query service would not want to use this.

We could create a small hook in Wikibase that lets us inject arbitrary things into max lag… and then in Wikidata.org do all the interaction with Prometheus and inject the lag.

See comment above about wikidata.org vs wikibase repose in general.

Do we (also?) want to have a pure SPARQL implementation?

In case we do Prometheus based solution, then I wouldn't keep two implementations unless I want to use one as a fallback for the other one. Are you thinking of using SPARQL implementation as a fallback in case Prometheus is not reachable?

I mostly thought that might be useful for third party instance that only run one-instance query services... but it might be better to stick with YAGNI and just implement this in Wikidata.org for Wikidata.

I agree, not yet :) lets just fix it for us first.

WMDE-leszek triaged this task as High priority.Mon, Aug 5, 2:09 PM
noarave removed hoo as the assignee of this task.Thu, Aug 8, 10:10 AM