Page MenuHomePhabricator

WDQS server/updater performance issues
Closed, InvalidPublic

Description

The situation with update lag keeps deteritoriating (it's 2 hours behind now and is not improving) and it looks like we've reached the bottleneck for capacity. The servers with no load seem to be keeping up fine, but the loaded ones keep falling behind.

Possible solutions:

  • Lower query timeout
  • Add more servers to serve the load
  • Reduce throttling thresholds

It looks like the update load has increased significantly recently, and we have to keep up somehow.

Any other ideas are welcome. We have a long-term plan to look into update performance within Blazegraph, but it will probably take significant time to develop something working, and in the meantime we have servers crumbling under the load.

Event Timeline

Restricted Application added a project: Wikidata. · View Herald TranscriptNov 10 2018, 1:17 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Smalyshev triaged this task as Unbreak Now! priority.Nov 10 2018, 1:17 AM
Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptNov 10 2018, 1:17 AM

Looking at the servers, we have very low update throughput:

01:17:43.485 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2018-11-09T23:13:34Z at (3.5, 3.9, 1.9) updates per second and (543.7, 598.2, 291.9) milliseconds per second

Which probably means query load is too high for them. Normal throughput should be at least 8-10 updates per second.

One thing to consider here to stop the situation getting too terrible would be to add the wdqs lag to the maxlag for wikidata.org

The edit rate on wikidata drops massively when the maxlag hits 5 for either DBs or the dispatching process.

Looking at the edit rate, and page creation rate during the time period that the query service seemed to struggle updating it doesn't look like there was any real increase for the ~20 hours in question (10 hours of which the lag was going up and 10 hours of which the lag was going down). There was also no real increase to the bytes of data added during that time.

Looking at the wdqs dashboards too I don't really see any sharp increase in queries or updates, load etc. ? Perhaps an increase in the number of procs running?

I guess there is 1 updater for each 1 wdqs host? And the updaters run on the same machine answering queries? I wonder if running them on a different host might provide a small bit of processing relief for the wdqs hosts?

Regarding what I said in T209201#4738324 re adding the wdqs lag to maxlag... that could be slightly tricky.

Right now the options for getting the maxlag are querying each server individually, or querying prometheus.

Also, waiting on a external service to return a result for the maxlag before performing an action might take too long.

A maxlag for the wdqs machines could be stored in some cache for checking and updater periodically, perhaps by a DeferrableUpdate or something similar?

FYI that is now filed as T209459, and afiak only affects requests with the "oresscores" param

Could it be related to this report

We are not using RC changes now, but Kafka stream, so not likely.

I wonder if running them on a different host might provide a small bit of processing relief for the wdqs hosts?

Updater process does not consume a lot of CPU (most work is done by Blazegraph, Updater just creates the query and downloads the data) so it's not very likely to have much effect.

Addshore added a comment.EditedNov 16 2018, 3:46 PM

So it sounds like the only place to fix this is within blazegraph itself.?
Do we have any idea of the rate of changes that starts to cause issues?

Again looking at the edit rate and creation rate on wikidata in relation to the lags on wdqs I don't really see a correlation.
From here the correlation must be between general wdqs request load and the updates :/

If the result of this is that the request timeout etc are lowered it might be worth revisiting T187424 and or T104762
Totally live real time queries are great, but maybe they need to be limited some more, but we still want to allow people to make the larger queries.

So it sounds like the only place to fix this is within blazegraph itself.?

One of the solutions may be to try and figure out how to do faster updates. Another would be to add servers to production cluster to spread query load. The pattern seems to be very dependant on query load, and does not happen on non-public cluster.

Again looking at the edit rate and creation rate on wikidata in relation to the lags on wdqs I don't really see a correlation.

I don't think it's edit load by itself. It's edit load PLUS data size PLUS query load that pushes the public servers over the edge where we see lags. Edit load alone doesn't do it. That said, I notice significant increase pattern here: https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&panelId=7&fullscreen&orgId=1 (note that both RC and Kafka stream process revisions, so RC count is the metric to look at).

f the result of this is that the request timeout etc are lowered it might be worth revisiting

Yes, but it's a project for which currently we don't have enough bandwidth. We're
working on changing that. As a side note, I have a bot that is meant to do queued stored requests: https://commons.wikimedia.org/wiki/User:TabulistBot but looks like nobody is interested in using it.

One thing to consider here to stop the situation getting too terrible would be to add the wdqs lag to the maxlag for wikidata.org

@Addshore I am not very informed on this one, what's maxlag and how it works?

Smalyshev lowered the priority of this task from Unbreak Now! to High.Nov 17 2018, 1:44 AM

One thing to consider here to stop the situation getting too terrible would be to add the wdqs lag to the maxlag for wikidata.org

@Addshore I am not very informed on this one, what's maxlag and how it works?

maxlag is a parameter that API users can specify to avoid overloading the wiki: if I send an API request with maxlag=5, and the database replicas are currently more than five seconds behind the master, then MediaWiki will immediately refuse the request. Afterwards, I’m supposed to wait for a bit before retrying the request. See https://www.mediawiki.org/wiki/Manual:Maxlag_parameter.

Last year, we modified the API’s behavior so that this takes into account not just the replication lag, but also the dispatch lag (T194950: Include Wikibase dispatch lag in API "maxlag" enforcing) – if the database replicas are fine, but change dispatching to client wikis is more than 5 minutes behind, then requests with maxlag=5 will still be rejected. (The dispatchLagToMaxLagFactor is configurable, 60 in production, so the threshold for dispatch lag should be in minutes instead of seconds if I’m not mistaken.) So if we can more or less easily get the (average? median? max?) query service update lag from within Wikibase, then it might make sense to include that lag in the calculation as well, with another configurable factor.

As outlined in T209201#4738340, the implementation might be tricky, but I think this should in principle be doable, and would probably be a good idea.

One thing to consider here to stop the situation getting too terrible would be to add the wdqs lag to the maxlag for wikidata.org

Now a dedicated task: T221774: Add Wikidata query service lag to Wikidata maxlag

Addshore moved this task from in progress to monitoring on the Wikidata board.Jun 22 2019, 10:38 PM
Addshore closed this task as Invalid.Apr 17 2020, 6:45 PM

Invalid as I believe this is now tracked under T235759

Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptApr 17 2020, 6:45 PM