Page MenuHomePhabricator

`api-backend` regularly crashing due to OOM
Closed, ResolvedPublic

Description

For example

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    0
  Started:      Wed, 03 Jan 2024 09:15:13 +0000
  Finished:     Wed, 03 Jan 2024 09:17:09 +0000

As we can see this is regularly happening

$ k get pods | grep api-app-backend 
api-app-backend-5686fff447-grv26                 1/1     Running     54 (11m ago)    12d
api-app-backend-5686fff447-rskck                 1/1     Running     61 (10m ago)    12d

The outcome is that the hosted wikis are coming on and offline

Event Timeline

Looking at

image.png (404×1 px, 48 KB)
there was a sudden increase in cpu and memory consumption around 9am UTC on 2nd (yesterday). This doesn't correspond perfectly with when we started more aggressively working on T353624 but it seems suspiciously close.

@Fring and I talked and decided that the best thing which will give us rapid feedback is to reduce the number of adhoc-jobs that run from 4 -> 1. This was done in https://github.com/wmde/wbaas-deploy/pull/1340

This seems to have resolved the issue. We can see memory has been stable for a while now and there have been no restarts for hours.

Tarrow claimed this task.