Page MenuHomePhabricator

Add Wikidata query service lag to Wikidata maxlag
Open, Needs TriagePublic8 Story Points

Description

As a high-volume editor on Wikidata, I want to ensure that my edits do not impair query service responsiveness.

Problem:
maxlag is a parameter that API users can specify to avoid overloading the wiki: if I send an API request with maxlag=5, and the database replicas are currently more than five seconds behind the master, then MediaWiki will immediately refuse the request. Afterwards, I’m supposed to wait for a bit before retrying the request. See https://www.mediawiki.org/wiki/Manual:Maxlag_parameter.

Last year, we modified the API’s behavior so that this takes into account not just the replication lag, but also the dispatch lag (T194950: Include Wikibase dispatch lag in API "maxlag" enforcing) – if the database replicas are fine, but change dispatching to client wikis is more than 5 minutes behind, then requests with maxlag=5 will still be rejected. (The dispatchLagToMaxLagFactor is configurable, 60 in production, so the threshold for dispatch lag is in minutes instead of seconds.)

However, this does not take the query service lag into account – if updates on some or all of the WDQS servers start to lag behind, edits will continue at full speed as long as database replication and client dispatching are not affected. This can happen because query service lag depends not just on edit rate but also on the size of the entities edited (on each edit, the full entity is reloaded, even if only a small part of it was edited, so editing large items has a disproportionate impact) and the rate of external queries against the server.

BDD
GIVEN all the WDQS servers are lagged by more than one hour
WHEN I send a wbeditentity API request
AND I set the maxlag parameter to 5 seconds
THEN I should get a maxlag error
AND no edit should be made

(the GIVEN part describes a rather extreme case; once the open questions below are qualified, it can perhaps be changed to a more realistic case)

Acceptance criteria:

  • the effective max lag takes query service update lag into account

Open questions:

  • What should the conversion factor be? (For dispatch lag, it’s 60 – five seconds of replication lag are equivalent to five minutes of dispatch lag.)
  • Lag between different servers can differ significantly. Do we use the mean lag? The median? The maximum? Something else? (For dispatch lag, we seem to use the median.)
  • maxlag affects all API requests, even ones that shouldn’t have any effect on query service lag, such as action=wbgetentity or action=query&meta=userinfo. Should we try to limit the impact of this change, e. g. by only using query service lag on POST requests? (On the other hand, the same question should apply to dispatch lag and we don’t seem to limit the impact of that as far as I can tell.)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 24 2019, 2:14 PM

[Implementing this task] could be slightly tricky.

Right now the options for getting the maxlag are querying each server individually, or querying prometheus.

Also, waiting on a external service to return a result for the maxlag before performing an action might take too long.

A maxlag for the wdqs machines could be stored in some cache for checking and updater periodically, perhaps by a DeferrableUpdate or something similar?

Lucas_Werkmeister_WMDE updated the task description. (Show Details)

What should the conversion factor be? (For dispatch lag, it’s 60 – five seconds of replication lag are equivalent to five minutes of dispatch lag.)

Based on the last three months, 5 minutes seems fine but we could go for 10 if needed.

Lag between different servers can differ significantly. Do we use the mean lag? The median? The maximum? Something else? (For dispatch lag, we seem to use the median.)

As seen in the previously linked graph, there are some differences yes. I would at least ignore the ones under 1 minute, the median of the ones above that should give a good indication.

maxlag affects all API requests, even ones that shouldn’t have any effect on query service lag, such as action=wbgetentity or action=query&meta=userinfo. Should we try to limit the impact of this change, e. g. by only using query service lag on POST requests? (On the other hand, the same question should apply to dispatch lag and we don’t seem to limit the impact of that as far as I can tell.)

(I realized yesterday why this isn’t as big a deal as I thought – API users could decide to only specify maxlag on POST requests themselves. Though I suppose we shouldn’t encourage that, because in the case of replication lag you really want all requests to be throttled.)

Lucas_Werkmeister_WMDE set the point value for this task to 8.

Let’s start with 10 minutes (factor of 120) and lower it to 5 (60) if necessary. And use the median lag of all the servers for now.

hoo claimed this task.
hoo added a subscriber: hoo.

Possible way to do this:

Create PrometheusBlazegraphLagService class which internally fetches the lag from a given Blazegraph instance like curl "http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query=scalar(time()%20-%20blazegraph_lastupdated%7Binstance%3D%22wdqs1005.eqiad.wmnet%3A9193%22%7D)" (where wdqs1005.eqiad.wmnet is to be replaced by the hostname). That would be cached (given we don't want to hit Prometheus often and as we care for lag in the 30-60m range, fetching this once ever 1-5m should be fine… this could maybe even be done in a Job). We would do that for all known wdqs instances and then sum/average/… the results.

This value would then be used for adjusting maxlag, as described above.

Things to consider:

  • Is going to Prometheus directly the right thing to do?
  • How often can we sanely hit Prometheus?
  • Where do we want to manage the list of WDQS instance for this? (Or I guess can we also ask Prometheus for all metrics at once?)
hoo added a subscriber: fgiunchedi.May 3 2019, 12:02 PM
hoo added a subscriber: Smalyshev.May 6 2019, 11:00 AM

Alternative approach:

Ask the WDQS instances directly by using MediaWiki's SparqlClient with SELECT * WHERE { <http://www.wikidata.org> schema:dateModified ?y }. That would still require caching, probably doing all of this in a job and unfortunately we would still need a list of all WDQS instances we care about.

@Smalyshev @fgiunchedi What do you think? Go through Prometheus or ask the instances individually or …?

A very minor note: WDQS depends heavily on mediawiki nodes to function, making mediawiki apps to depend on WDQS (which we already has done in WikibaseQualityConstraints) will make a cyclic dependency that bit Wikipedia a lot with ores extension and ores service. Overall, it's not an issue but every dependency should be as loose as possible, if the wdqs didn't respond or timed out and errored out for whatever reason, gracefully ignore.

Smalyshev added a comment.EditedMay 7 2019, 5:11 PM

Go through Prometheus or ask the instances individually or …?

prometheus & icinga know what is the lag for each server, so I think it's better to ask them? I am not sure if we have good support for scheduled recurrent jobs on Wiki - if we do, we could just make a script that asks prometheus and puts the data into the Wikidata DB somewhere.

if the wdqs didn't respond or timed out and errored out for whatever reason, gracefully ignore.

Yes, this should happen anyway - if WDQS does not respond, there should be graceful failure route, and internal functions should use sane timeouts for each specific function.

hoo added a comment.EditedMay 8 2019, 12:22 PM

@Smalyshev Do you think it would be enough to look at http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query=blazegraph_lastupdated and http://prometheus.svc.codfw.wmnet/ops/api/v1/query?query=blazegraph_lastupdated (no matter which DC MW is running in) and just take all servers into account? Or do we need a whitelist/blacklist (or both) or some other mechanism to make sure we don't eg. take servers into account that are being maintained.

If we use median to aggregate the lags (or maybe even a higher percentile?) we might have robust enough results even if a few servers are in maintenance?

Or maybe filter on cluster?

Probably prometheus needs to define a discovery url: "http://prometheus.discovery.wmnet" (similar to http://ores.discovery.wmnet) @fgiunchedi knows better though.

do we need a whitelist/blacklist (or both) or some other mechanism to make sure we don't eg. take servers into account that are being maintained.

Ideally, only servers in the public pool, or maybe public+internal pool, should be taken into account, this would exclude maintained servers (as they are depooled for maintenance) and test servers (as they wouldn't be in the pool). But I am not sure whether it's possible or how hard it is to do something like that. Somebody from Operations probably should know more about these.

Possible way to do this:

Create PrometheusBlazegraphLagService class which internally fetches the lag from a given Blazegraph instance like curl "http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query=scalar(time()%20-%20blazegraph_lastupdated%7Binstance%3D%22wdqs1005.eqiad.wmnet%3A9193%22%7D)" (where wdqs1005.eqiad.wmnet is to be replaced by the hostname). That would be cached (given we don't want to hit Prometheus often and as we care for lag in the 30-60m range, fetching this once ever 1-5m should be fine… this could maybe even be done in a Job). We would do that for all known wdqs instances and then sum/average/… the results.

This value would then be used for adjusting maxlag, as described above.

Things to consider:

  • Is going to Prometheus directly the right thing to do?
  • How often can we sanely hit Prometheus?
  • Where do we want to manage the list of WDQS instance for this? (Or I guess can we also ask Prometheus for all metrics at once?)

re: frequency even once a minute would be fine, since the query isn't heavy to run, and yes you can ask about all instances at once, or e.g. take the max().

Which leads me to a question re: servers in maintenance, where / how is the list maintained or will be maintained of all instances and/or instances in maintenance? I'm asking because if the list of instances that should be queried is known anyways IMHO it'd be simpler to query the lag via sparql and keep the prometheus out of the loop entirely. I'm saying this because IIRC the "lastupdated" value would go blazegraph -> prometheus -> mediawiki

hoo added a comment.EditedMay 13 2019, 3:03 PM

@fgiunchedi @Smalyshev How about I do something like this:

$result = json_decode(file_get_contents('http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query=blazegraph_lastupdated'), true)['data']['result'];
foreach ( $result as $resultByInstance ) {
	echo $resultByInstance['metric']['instance'] . ' (' .$resultByInstance['metric']['cluster'] . ') has been updated at ' . $resultByInstance['value'][1] . PHP_EOL;
}
wdqs1004:9193 (wdqs) has been updated at 1557755605
wdqs1005:9193 (wdqs) has been updated at 1557755946
wdqs1006:9193 (wdqs) has been updated at 1557758395
wdqs1003:9193 (wdqs-internal) has been updated at 1557758386
wdqs1007:9193 (wdqs-internal) has been updated at 1557758374
wdqs1008:9193 (wdqs-internal) has been updated at 1557758353
wdqs1009:9193 (wdqs-test) has been updated at 1557758403
wdqs1010:9193 (wdqs-test) has been updated at 1557758409

Now if we do this once per DC and filter with a list of clusters we care about via config ("wdqs" and "wdqs-internal" probably), would that work out?

Note that wdqs-test should not be included in maxlag at any case. These are test servers, which can be lagged significantly due to any number of reasons. Also, I'm not sure how easy it is to see if the server is in the pool, but it'd be nice if it was possible to exclude depooled servers, because this usually means the server is under maintenance.

Change 512393 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Add Wikidata query service lag to Wikidata maxlag

https://gerrit.wikimedia.org/r/512393