Page MenuHomePhabricator

WDQS is having high update lag for the last week
Closed, ResolvedPublic

Description

Lag on the public WDQS servers is climbing, reaching > 12h. This makes a lot of workflows unusable.

We don't have a clear understanding of what exactly is going wrong. Update rate is a recurring issue, we might just have reached a tipping point. There are a few potential medium term improvements:

  • switch to a new updater process: T212826
  • increase parallelism in updater: T238045

Also, T221774: Add Wikidata query service lag to Wikidata maxlag should in the future suspend most Wikidata edits if update lag is too high, allowing the servers to recover automatically.

Event Timeline

Gehel triaged this task as High priority.Nov 14 2019, 9:33 AM
Gehel updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2019-11-14T09:55:32Z] <gehel> depool wdqs (public) eqiad - high lag - T238229

now that all traffic is going to codfw, lag should be down. It will probably start to rise now that the load is increasing.

One thing that seems odd (to an outsider like me who knows very little about the system) is that some servers seem to be performing so much worse than others.

Is there a simple reason for this (eg an entire cluster having problems?), or does this suggest there may also be issues with load-balancing, of which servers pick up which queries?

This is probably part of the reason:

These clusters are in active/active mode (traffic is sent to both), but due to how we route traffic with GeoDNS, the primary cluster (usually eqiad) sees most of the traffic.

The eqiad public cluster gets most of the query load, so the other clusters have an easier time keeping up with updates. Out of that cluster, wdqs1005 was depooled yesterday for (probably) unrelated reasons (T238232), that’s probably why it’s recovered since then; I don’t understand why wdqs1006 isn’t as lagged as wdqs1004, though.

Mentioned in SAL (#wikimedia-operations) [2019-11-14T13:35:22Z] <gehel> depool wdqs1004 to allow catching up on lag - T238229

Current situation:

  • wdqs1004 is depooled to see if it helps it catch up on lag
  • wdqs1005 has had it's journal reset as part of T238232
  • wdqs1006 has catch up on lag on its own

Currently, the public WDQS endpoint should be exposing fairly low lag. None of the underlying issues have been addressed, so it is likely that the situation will degrade again.

Mentioned in SAL (#wikimedia-operations) [2019-11-14T20:04:25Z] <gehel> reloading data on wdqs1004 from wdqs1007 to catch up on lag faster - T238229

Change 551189 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: move wdqs1007 from internal to public cluster

https://gerrit.wikimedia.org/r/551189

Change 551189 merged by Gehel:
[operations/puppet@production] wdqs: move wdqs1007 from internal to public cluster

https://gerrit.wikimedia.org/r/551189

one additional server has been moved to the public cluster to provide more resources. Let's see if it helps.

Mentioned in SAL (#wikimedia-operations) [2019-11-20T16:03:58Z] <gehel> depool wdqs1004 to allow catching up on lag - T238229

Looks like depooling 1004 resulted in a pretty quick drop in lag. 1.5 hours was enough and it hasn't regrown in greater than 12 hours.

Its not a pretty solution but: is there any point in regularly rotating each wdqs instance out of the pool? This might keep it low enough to be reusable.

I looked into some of the requests and made a quick poke to the agent filtering.
Didn't tag the task, so here is the link... https://gerrit.wikimedia.org/r/#/c/wikidata/query/deploy/+/552236/

Change 552277 had a related patch set uploaded (by Gehel; owner: Gehel):
[wikidata/query/rdf@master] Start dropping requests when load is too high.

https://gerrit.wikimedia.org/r/552277

Mentioned in SAL (#wikimedia-operations) [2019-11-22T09:27:39Z] <gehel> depool wdqs1007 to allow to catch up on lag - T238229

T221774 was done as part 1 of adding query service lag to max lag (already deployed)
T238751 continues work on this area meaning the max lagged pool server will be used instead of the current median.
Should be able to get that deployed next week.

Mentioned in SAL (#wikimedia-operations) [2019-11-23T18:19:57Z] <gehel> repool wdqs1007, catched up on lag - T238229

Change 552277 merged by jenkins-bot:
[wikidata/query/rdf@master] Start dropping requests when load is too high.

https://gerrit.wikimedia.org/r/552277

I filed T240540 as a followup to all of this lag and seeing what crazy things people are using the query service for

Addshore claimed this task.

image.png (292×939 px, 54 KB)

The lags on the weeks mentioned in this ticket are now kind of gone.
I'll close this and I guess everything else can be done in the tickets that have been created as a result of this