Page MenuHomePhabricator

Review maps outage happened on Feb 17th 2025
Closed, ResolvedPublic

Description

Today I did the following:

  • Pool two more k8s nodes in eqiad and codfw to kartotherian-k8s-ssl (LVS service).
  • Afte a while, I depooled one bare metal node from eqiad (maps1005, since maps1006 was already depooled). Same thing in codfw.

The eqiad cluster became unstable, and SRE got paged for maps.wikimedia.org's high 50X error rate. From a quick check I didn't see anything problematic on postgres, nor excessive load on CPU/memory resources.

On maps1007, I noticed a lot of errors like the following, starting from around 14:27 UTC:

{"name":"kartotherian","hostname":"maps1007","pid":628,"level":50,"err":{"message":"ETIMEDOUT","name":"kartotherian","stack":"HTTPError: ETIMEDOUT\n    at request.then (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/preq/index.js:246:19)\n    at tryCatcher (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/util.js:16:23)\n    at Promise._settlePromiseFromHandler (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/promise.js:547:31)\n    at Promise._settlePromise (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/promise.js:604:18)\n    at Promise._settlePromise0 (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/promise.js:649:10)\n    at Promise._settlePromises (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/promise.js:725:18)\n    at _drainQueueStep (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/async.js:93:12)\n    at _drainQueue (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/async.js:86:9)\n    at Async._drainQueues (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/async.js:102:5)\n    at Immediate.Async.drainQueues [as _onImmediate] (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/async.js:15:14)\n    at runCallback (timers.js:705:18)\n    at tryOnImmediate (timers.js:676:5)\n    at processImmediate (timers.js:658:5)\n    at process.topLevelDomainCallback (domain.js:126:23)","status":504,"headers":{"content-type":"application/problem+json"},"body":{"type":"internal_http_error","detail":"ETIMEDOUT","internalStack":"Error: ETIMEDOUT\n    at Timeout.setTimeout [as _onTimeout] (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/preq/index.js:15:27)\n    at ontimeout (timers.js:436:11)\n    at tryOnTimeout (timers.js:300:5)\n    at listOnTimeout (timers.js:263:5)\n    at Timer.processTimers (timers.js:223:10)","internalURI":"https://en.wikipedia.org/w/api.php","internalQuery":"{\"format\":\"json\",\"formatversion\":\"2\",\"action\":\"query\",\"revids\":\"1271251461\",\"prop\":\"mapdata\",\"mpdlimit\":\"max\",\"mpdgroups\":\"_e6a88e1f1e949f77482ae1341c45109c46b39044\"}","internalErr":"ETIMEDOUT","internalMethod":"get"},"levelPath":"error"},"msg":"ETIMEDOUT","time":"2025-02-17T14:27:06.358Z","v":0}

The error mentions the mediawiki php API, but at the time of the outage nothing was ongoing so it is not clear to me why.

At first I tried to repool bare metal nodes, but it didn't help much. A simple and local systemctl restart kartotherian didn't work on maps1007, the host kept getting back into throwing a lot of ETIMEDOUT again. I then tried to roll restart all the maps1* kartotherians in a short batch (10 seconds wait time before the next restart, one at the time via cumin) and after a while the cluster came back to its normal state.

Metrics timeframe for the outage:

For some reason on bare-metal the "Performance" panel showed what appears to be a huge increase in load, even if it is not clear to me its meaning (percentiles mentioned, but the scale seems to be related to a counter). If we assume latency, something slowed down in the cluster while the outage was ongoing, that seems to fit with the timeouts.

On the k8s side I didn't see anything weird, even if I have to say that the metrics are still not 100% perfect.

The timeline of the depools is listed in https://sal.toolforge.org/production?p=0&q=maps&d=2025-02-17. For some reason codfw seemed not affected, meanwhile eqiad took the hit.

What was different? In eqiad maps1006 was already depooled due to an old load test (we forgot to repool it), and then I just depooled maps1005 a few mins before the start of the outage (10 mins more or less).

Summary: in eqiad depooling bare metal nodes seems to lead to a huge pressure on the remaining bare metal nodes, that end up in increased latency and timeouts. Repool and roll restart seem to fix.

@Jgiannelos have you seen something similar before? Any theory?

Event Timeline

I possible scenario can be the following:

  • We have mixed servers (k8s and bare metal)
  • The old bare metal environment is allowed to talk to public mediawiki API directly (eg. en.wikipedia.org) where the new k8s setup doesnt.
  • When we receive a request for a snapshot with an overlay geoshape (lets call it A) the service makes an HTTP API request to itself (lets call this request B)
  • In the old bare metal setup the request (B) for the "internal" geoshapes is sent to the public kartotherian endpoint
  • It could be the case that one of the initial requests A was handled by a bare metal server and the follow up (B) "internal requests" ended up in k8s which doesn't allow external traffic
  • This caused timeouts and because requests were hanging, increase in load
Aklapper renamed this task from Review maps outage happened on Feb 17th to Review maps outage happened on Feb 17th 2025.Feb 18 2025, 10:36 AM

Thanks a lot for the brainbounce :)

From logstash I don't see anything logged by k8s eqiad pods during the outage, if request B failed I'd have expected to see some failures (but maybe kartotherian doesn't currently log it for some reason).

In the use case outlined above, request A generated the internal geoshape request B to the public kartotherian endpoint, that is maps.wikimedia.org. At that point, request B can be served from either a pod or a bare metal node, but both of them should be ok to return the right HTTP response. Am I missing something? In any case we should verify in staging (k8s only) if we can reproduce this case..

I agree that it is something related to the bare metal nodes being left with less resources, because for some reason they cannot use what is offered by k8s pods. I am wondering if we should just depool all bare metals entirely in codfw (that seems less trafficated), after pooling in all the 5 k8s pods, and see if any trouble emerges. Maybe the new version can handle traffic in the right way, meanwhile the old one needs a specific amount of bare metal nodes to be up otherwise it starts failing.

I am not happy that we don't have a clear root cause yet, trying the above procedure seems a shot in the dark and I don't love to work in this way, but maybe it is the only way forward.

Lemme know your thoughts @Jgiannelos :)

The errors not showing up on the k8s side could be because there was actually no error, the ETIMEDOUT is raised in the client side, the server side just hangs.

The errors not showing up on the k8s side could be because there was actually no error, the ETIMEDOUT is raised in the client side, the server side just hangs.

Something doesn't add up though - the HTTP 504 stacktrace mentions "internalURI":"https://en.wikipedia.org/w/api.php" for the ETIMEDOUT, not maps.wikimedia.org..

I agree, but the only env that could hang on en.wikipedia.org is k8s, maps nodes can talk to that endpoint directly.

I want to make a test via LVS load balancing config, namely put more weights on k8s pods/workers (5 in total in each DC) so that more requests will be served there. If my theory is correct, namely that bare metals can be overwhelmed really easy, we shouldn't see any impact but more requests served by k8s. Otherwise we'll see a raise in timeouts, and a rollback to the previous step should be easy.

I took a quick look this morning after seeing the discussion in IRC, and it looks like none of the k8s worker nodes that have been added to the kartotherian-k8s-ssl service in conftool actually have the LVS IP configured on their loopback device. In terms of service turnup steps, kartotherian-k8s-ssl was never added to profile::lvs::realserver::pools on these hosts.

Not 100% sure what routing the behavior would be in that case ... Perhaps hairpin back to LVS? Worst case, loop? (e.g., there was a comment in IRC about seeing TTL exceeded)

In any case, since the desired end-state is to have all k8s workers be aware of the LVS IP, it probably makes sense to add the service to the pools list in hieradata/role/common/kubernetes/worker.yaml.

For debugging purposes, this is a URL that requests a snapshot with an overlay map from en.wikipedia.org:

https://maps.wikimedia.org/img/osm-intl,a,a,a,300x200.png?lang=en&domain=en.wikipedia.org&title=Alabama&revid=1268325471&groups=_1b46af921bb4e1f090a1b07748a50bd2e1f322fc

From my local env when I try this request this shows up in the trace logs so indeed it should eventually request en.wikipedia.org:

[2025-02-19T05:00:20.053Z] TRACE: kartotherian/20 on 48bfdbb6c162: Outgoing request (request_id=6b03dd00-ee7e-11ef-b32e-cf29ca820246, levelPath=trace/req)
    out_request: {
      "method": "post",
      "uri": "https://en.wikipedia.org/w/api.php",
      "headers": {
        "user-agent": "kartotherian",
        "x-request-id": "6b03dd00-ee7e-11ef-b32e-cf29ca820246"
      },
      "body": {
        "format": "json",
        "formatversion": "2",
        "action": "query",
        "revids": "1268325471",
        "prop": "mapdata",
        "mpdlimit": "max",
        "mpdgroups": [
          "_1b46af921bb4e1f090a1b07748a50bd2e1f322fc"
        ]
      }
    }
    --
    request: {
      "url": "/img/osm-intl,a,a,a,300x200.png?lang=en&domain=en.wikipedia.org&title=Alabama&revid=1268325471&groups=_1b46af921bb4e1f090a1b07748a50bd2e1f322fc",
      "headers": {
        "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:135.0) Gecko/20100101 Firefox/135.0",
        "x-request-id": "6b03dd00-ee7e-11ef-b32e-cf29ca820246"
      },
      "method": "GET",
      "params": {
        "0": "/img/osm-intl,a,a,a,300x200.png"
      },
      "query": {
        "lang": "en",
        "domain": "en.wikipedia.org",
        "title": "Alabama",
        "revid": "1268325471",
        "groups": [
          "_1b46af921bb4e1f090a1b07748a50bd2e1f322fc"
        ]
      },
      "remoteAddress": "192.168.65.1",
      "remotePort": 48165
    }

Change #1120893 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kubernetes::worker: add kartotherian-k8s-ssl to the lvs pools

https://gerrit.wikimedia.org/r/1120893

Change #1120893 merged by Elukey:

[operations/puppet@production] role::kubernetes::worker: add kartotherian-k8s-ssl to the lvs pools

https://gerrit.wikimedia.org/r/1120893

I took a quick look this morning after seeing the discussion in IRC, and it looks like none of the k8s worker nodes that have been added to the kartotherian-k8s-ssl service in conftool actually have the LVS IP configured on their loopback device. In terms of service turnup steps, kartotherian-k8s-ssl was never added to profile::lvs::realserver::pools on these hosts.

Not 100% sure what routing the behavior would be in that case ... Perhaps hairpin back to LVS? Worst case, loop? (e.g., there was a comment in IRC about seeing TTL exceeded)

In any case, since the desired end-state is to have all k8s workers be aware of the LVS IP, it probably makes sense to add the service to the pools list in hieradata/role/common/kubernetes/worker.yaml.

This is almost probably the root cause of the problem, we were discussing on IRC why sometimes the Prometheus network probes were failing, and it was due to k8s workers not being able to accept traffic from the load balancer. When I removed bare metals from the LB I caused more pressure on the remaining nodes, that almost probably had timeouts due to k8s workers not being reachable.

@Jgiannelos see the above, my impression is that it was me causing the issue and not the code itself. In the example that you posted there is indeed a call to en.wikipedia.org, but I thought we had in place some rewrite-to-localhost-proxy config for that use case (I am pretty sure we tested it in staging).

I just depooled two bare metal nodes from each DC (like I did on Monday) and everything seems stable, so I think we got the root cause :)

Last step before closing: write the incident report.

Change #1121309 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: bump allocated memory for Kartotherian

https://gerrit.wikimedia.org/r/1121309

Change #1121309 merged by Elukey:

[operations/deployment-charts@master] services: bump allocated memory for Kartotherian

https://gerrit.wikimedia.org/r/1121309

Change #1121315 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: update cpu resources for kartotherian's mesh/statsd containers

https://gerrit.wikimedia.org/r/1121315

Change #1121315 merged by Elukey:

[operations/deployment-charts@master] services: update cpu resources for kartotherian's mesh/statsd containers

https://gerrit.wikimedia.org/r/1121315

I started reviewing the metrics related to Monday's outage but I didn't find what I expected. Let's start with timings:

https://sal.toolforge.org/production?p=1&q=&d=2025-02-17

From 12:49 UTC to 14:09 UTC the eqiad LVS load balancer contained backends (the k8s workers) without the kartotherian's loopback interface deployed, so sometimes a connection to kartotherian.discovery.wmnet:6543 ended up in a connection failure (due to looping between LVS and the wikikube workers, that didn't accept the IP packets for the Kartotherian SVC IP). In theory I'd have expected to see some impact registered on the CDN side as 502/503 for maps.wikimedia.org, but nothing. I am very happy if no harm to external users was caused during that timeframe, but I don't understand why :D

I rechecked the IRC logs to see if I somehow lost any notification of wikikube workers behind the kartotherian LVS not passing health checks, but nothing was registered before the 14:15 mark (after that the maps bare metals started for fail health checks).

Edit: mistery solved after talking with Valentin. ATS heavily reuse connections to the backend, as one can see in this graph. When I pooled the k8s workers, they passed health checks from pybal but ATS failed to connect to them, so it relied on the connection to the "healthy" bare metals. This allowed to reduce the impact by a lot :)