Page MenuHomePhabricator

Depool wdqs1007
Closed, ResolvedPublic

Assigned To
Authored By
Lucas_Werkmeister_WMDE
Oct 31 2022, 9:45 AM
Referenced Files
F35676315: image.png
Oct 31 2022, 9:56 AM
F35676313: image.png
Oct 31 2022, 9:56 AM
F35676195: image.png
Oct 31 2022, 9:49 AM
F35676191: image.png
Oct 31 2022, 9:49 AM
F35676105: image.png
Oct 31 2022, 9:45 AM
F35676094: image.png
Oct 31 2022, 9:45 AM
F35676048: image.png
Oct 31 2022, 9:45 AM

Description

@RKemper restarted wdqs1007 earlier today (SAL), and it’s currently catching up on almost two days’ worth of update lag (Grafana):

image.png (258×916 px, 37 KB)

However, it looks like the server hasn’t been depooled – I got a query response served from it after a few retries:

$ curl -s -i --data-urlencode 'query=ASK{}' https://query.wikidata.org/sparql | grep -i '^x-served-by:'
x-served-by: wdqs1007

This means that some users are getting stale query responses; but more importantly, it means that the Wikidata maxlag has shot up (Grafana)

image.png (258×916 px, 21 KB)

and the edit rate has plummeted accordingly (Grafana):
image.png (258×916 px, 68 KB)

Thanks to recent work at T238751 and T315423, the maxlag is supposed to no longer take servers like this into account, as long as they’ve been depooled. Please manually depool this server, so that the maxlag can (hopefully) recover.

Event Timeline

(Automatically depooling lagged servers is apparently not implemented yet: T270614)

Marking as UBN, since it’s blocking all bot edits on Wikidata (except for badly behaved bots that ignore maxlag).

Mentioned in SAL (#wikimedia-operations) [2022-10-31T09:52:37Z] <gehel> depooling wdqs1007 while it catches up on lag - T322010

Lucas_Werkmeister_WMDE claimed this task.

Thanks! Looks like it’s working – api.php?maxlag=-1 is now complaining about a different, less lagged server:

{
    "error": {
        "code": "maxlag",
        "info": "Waiting for wdqs1015: 1.8 seconds lagged.",
        "host": "wdqs1015",
        "lag": 1.8,
        "type": "wikibase-queryservice",
        "queryserviceLag": 108,
        "*": "See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes."
    },
    "servedby": "mw1376"
}

Maxlag and edit rate both recovered on Grafana as well:

image.png (258×916 px, 21 KB)

image.png (258×916 px, 34 KB)

Re-opening, I'll close this when the server has catched up and is repooled.

dcausse lowered the priority of this task from Unbreak Now! to High.EditedOct 31 2022, 10:24 AM
dcausse subscribed.

While fixing T238751 I think the criteria to propagate the max lag did change from the median of all the servers to the most lagged server. Unfortunately I don't think this reflects the reality of the service, due to blazegraph instabilities we still can have servers being lagged for reasons unrelated to the edit throughput.

Blazegraph can still fail (e.g. T242453) and will cause the lag to rise (and we don't have the tooling to automatically depool servers that are lagged: T270614).

Could we change back the max lag to take the median of all pooled servers instead because I don't think that the search team has the capacity to maintain such level of service (at least without automatic solutions like T242453 and T270614)?

Mentioned in SAL (#wikimedia-operations) [2022-10-31T12:18:13Z] <gehel> repooling wdqs1007 - catched up on lag - T322010

While fixing T238751 I think the criteria to propagate the max lag did change from the median of all the servers to the most lagged server. Unfortunately I don't think this reflects the reality of the service, due to blazegraph instabilities we still can have servers being lagged for reasons unrelated to the edit throughput.

Blazegraph can still fail (e.g. T242453) and will cause the lag to rise (and we don't have the tooling to automatically depool servers that are lagged: T270614).

Could we change back the max lag to take the median of all pooled servers instead because I don't think that the search team has the capacity to maintain such level of service (at least without automatic solutions like T242453 and T270614)?

I made a task for that: T322030