Today I did the following:
- Pool two more k8s nodes in eqiad and codfw to kartotherian-k8s-ssl (LVS service).
- Afte a while, I depooled one bare metal node from eqiad (maps1005, since maps1006 was already depooled). Same thing in codfw.
The eqiad cluster became unstable, and SRE got paged for maps.wikimedia.org's high 50X error rate. From a quick check I didn't see anything problematic on postgres, nor excessive load on CPU/memory resources.
On maps1007, I noticed a lot of errors like the following, starting from around 14:27 UTC:
{"name":"kartotherian","hostname":"maps1007","pid":628,"level":50,"err":{"message":"ETIMEDOUT","name":"kartotherian","stack":"HTTPError: ETIMEDOUT\n at request.then (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/preq/index.js:246:19)\n at tryCatcher (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/util.js:16:23)\n at Promise._settlePromiseFromHandler (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/promise.js:547:31)\n at Promise._settlePromise (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/promise.js:604:18)\n at Promise._settlePromise0 (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/promise.js:649:10)\n at Promise._settlePromises (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/promise.js:725:18)\n at _drainQueueStep (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/async.js:93:12)\n at _drainQueue (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/async.js:86:9)\n at Async._drainQueues (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/async.js:102:5)\n at Immediate.Async.drainQueues [as _onImmediate] (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/bluebird/js/release/async.js:15:14)\n at runCallback (timers.js:705:18)\n at tryOnImmediate (timers.js:676:5)\n at processImmediate (timers.js:658:5)\n at process.topLevelDomainCallback (domain.js:126:23)","status":504,"headers":{"content-type":"application/problem+json"},"body":{"type":"internal_http_error","detail":"ETIMEDOUT","internalStack":"Error: ETIMEDOUT\n at Timeout.setTimeout [as _onTimeout] (/srv/deployment/kartotherian/deploy-cache/revs/483e8c3722435327559da0328fd604e00381ff5b/node_modules/preq/index.js:15:27)\n at ontimeout (timers.js:436:11)\n at tryOnTimeout (timers.js:300:5)\n at listOnTimeout (timers.js:263:5)\n at Timer.processTimers (timers.js:223:10)","internalURI":"https://en.wikipedia.org/w/api.php","internalQuery":"{\"format\":\"json\",\"formatversion\":\"2\",\"action\":\"query\",\"revids\":\"1271251461\",\"prop\":\"mapdata\",\"mpdlimit\":\"max\",\"mpdgroups\":\"_e6a88e1f1e949f77482ae1341c45109c46b39044\"}","internalErr":"ETIMEDOUT","internalMethod":"get"},"levelPath":"error"},"msg":"ETIMEDOUT","time":"2025-02-17T14:27:06.358Z","v":0}The error mentions the mediawiki php API, but at the time of the outage nothing was ongoing so it is not clear to me why.
At first I tried to repool bare metal nodes, but it didn't help much. A simple and local systemctl restart kartotherian didn't work on maps1007, the host kept getting back into throwing a lot of ETIMEDOUT again. I then tried to roll restart all the maps1* kartotherians in a short batch (10 seconds wait time before the next restart, one at the time via cumin) and after a while the cluster came back to its normal state.
Metrics timeframe for the outage:
For some reason on bare-metal the "Performance" panel showed what appears to be a huge increase in load, even if it is not clear to me its meaning (percentiles mentioned, but the scale seems to be related to a counter). If we assume latency, something slowed down in the cluster while the outage was ongoing, that seems to fit with the timeouts.
On the k8s side I didn't see anything weird, even if I have to say that the metrics are still not 100% perfect.
The timeline of the depools is listed in https://sal.toolforge.org/production?p=0&q=maps&d=2025-02-17. For some reason codfw seemed not affected, meanwhile eqiad took the hit.
What was different? In eqiad maps1006 was already depooled due to an old load test (we forgot to repool it), and then I just depooled maps1005 a few mins before the start of the outage (10 mins more or less).
Summary: in eqiad depooling bare metal nodes seems to lead to a huge pressure on the remaining bare metal nodes, that end up in increased latency and timeouts. Repool and roll restart seem to fix.
@Jgiannelos have you seen something similar before? Any theory?