Maniphest T322010

Depool wdqs1007
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Lucas_Werkmeister_WMDE
	Oct 31 2022, 9:45 AM

Description

@RKemper restarted wdqs1007 earlier today (SAL), and it’s currently catching up on almost two days’ worth of update lag (Grafana):

However, it looks like the server hasn’t been depooled – I got a query response served from it after a few retries:

$ curl -s -i --data-urlencode 'query=ASK{}' https://query.wikidata.org/sparql | grep -i '^x-served-by:'
x-served-by: wdqs1007

This means that some users are getting stale query responses; but more importantly, it means that the Wikidata maxlag has shot up (Grafana)

and the edit rate has plummeted accordingly (Grafana):

Thanks to recent work at T238751 and T315423, the maxlag is supposed to no longer take servers like this into account, as long as they’ve been depooled. Please manually depool this server, so that the maxlag can (hopefully) recover.

Related Objects

Mentioned In: T322030: Revert query service maxlag criteria from “most lagged” back to median-based implementation
Mentioned Here: T322030: Revert query service maxlag criteria from “most lagged” back to median-based implementation
T242453: Detect and alert and/or remediate Blazegraph deadlocks
T270614: Automatically depool wdqs servers that are "lagged"
T238751: Only generate maxlag from pooled query service servers.
T315423: Revive and merge patch to update maxlag calculation

Event Timeline

Lucas_Werkmeister_WMDE created this task.Oct 31 2022, 9:45 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 31 2022, 9:45 AM

Michael subscribed.Oct 31 2022, 9:46 AM

(Automatically depooling lagged servers is apparently not implemented yet: T270614)

Marking as UBN, since it’s blocking all bot edits on Wikidata (except for badly behaved bots that ignore maxlag).

Lucas_Werkmeister_WMDE updated the task description. (Show Details)Oct 31 2022, 9:49 AM

Mentioned in SAL (#wikimedia-operations) [2022-10-31T09:52:37Z] <gehel> depooling wdqs1007 while it catches up on lag - T322010

Thanks! Looks like it’s working – api.php?maxlag=-1 is now complaining about a different, less lagged server:

{
    "error": {
        "code": "maxlag",
        "info": "Waiting for wdqs1015: 1.8 seconds lagged.",
        "host": "wdqs1015",
        "lag": 1.8,
        "type": "wikibase-queryservice",
        "queryserviceLag": 108,
        "*": "See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes."
    },
    "servedby": "mw1376"
}

Lucas_Werkmeister_WMDE reassigned this task from Lucas_Werkmeister_WMDE to Gehel.Oct 31 2022, 9:54 AM

Maxlag and edit rate both recovered on Grafana as well:

Re-opening, I'll close this when the server has catched up and is repooled.

Gehel moved this task from Incoming to Waiting on the Discovery-Search (Current work) board.Oct 31 2022, 10:19 AM

While fixing T238751 I think the criteria to propagate the max lag did change from the median of all the servers to the most lagged server. Unfortunately I don't think this reflects the reality of the service, due to blazegraph instabilities we still can have servers being lagged for reasons unrelated to the edit throughput.

Blazegraph can still fail (e.g. T242453) and will cause the lag to rise (and we don't have the tooling to automatically depool servers that are lagged: T270614).

Could we change back the max lag to take the median of all pooled servers instead because I don't think that the search team has the capacity to maintain such level of service (at least without automatic solutions like T242453 and T270614)?

Mentioned in SAL (#wikimedia-operations) [2022-10-31T12:18:13Z] <gehel> repooling wdqs1007 - catched up on lag - T322010

Gehel closed this task as Resolved.Oct 31 2022, 12:18 PM

In T322010#8355973, @dcausse wrote:

While fixing T238751 I think the criteria to propagate the max lag did change from the median of all the servers to the most lagged server. Unfortunately I don't think this reflects the reality of the service, due to blazegraph instabilities we still can have servers being lagged for reasons unrelated to the edit throughput.

Blazegraph can still fail (e.g. T242453) and will cause the lag to rise (and we don't have the tooling to automatically depool servers that are lagged: T270614).

Could we change back the max lag to take the median of all pooled servers instead because I don't think that the search team has the capacity to maintain such level of service (at least without automatic solutions like T242453 and T270614)?

I made a task for that: T322030

	F35676315: image.png
	Oct 31 2022, 9:56 AM

	F35676313: image.png
	Oct 31 2022, 9:56 AM

Depool wdqs1007Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Depool wdqs1007
Closed, ResolvedPublic
Actions