`api-backend` regularly crashing due to OOM
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Tarrow
	Jan 3 2024, 10:00 AM

Description

For example

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    0
  Started:      Wed, 03 Jan 2024 09:15:13 +0000
  Finished:     Wed, 03 Jan 2024 09:17:09 +0000

As we can see this is regularly happening

$ k get pods | grep api-app-backend 
api-app-backend-5686fff447-grv26                 1/1     Running     54 (11m ago)    12d
api-app-backend-5686fff447-rskck                 1/1     Running     61 (10m ago)    12d

The outcome is that the hosted wikis are coming on and offline

Related Objects

Mentioned In: T356259: [Timebox 8h] - Investigate if the backend api appears unable to cope with production load to get wikiWithDomain
T354282: Alert on regularly restarting containers
Mentioned Here: T353624: 🟡 Rebuild Queryservice data for all production wikis

Event Timeline

Tarrow created this task.Jan 3 2024, 10:00 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 3 2024, 10:00 AM

Looking at

there was a sudden increase in cpu and memory consumption around 9am UTC on 2nd (yesterday). This doesn't correspond perfectly with when we started more aggressively working on T353624 but it seems suspiciously close.

@Fring and I talked and decided that the best thing which will give us rapid feedback is to reduce the number of adhoc-jobs that run from 4 -> 1. This was done in https://github.com/wmde/wbaas-deploy/pull/1340

Tarrow mentioned this in T354282: Alert on regularly restarting containers.Jan 3 2024, 2:42 PM

This seems to have resolved the issue. We can see memory has been stable for a while now and there have been no restarts for hours.

Tarrow mentioned this in T356259: [Timebox 8h] - Investigate if the backend api appears unable to cope with production load to get wikiWithDomain.Jan 31 2024, 10:30 AM

Tarrow closed this task as Resolved.Feb 7 2024, 12:06 PM

Tarrow claimed this task.

	F41649546: image.png
	Jan 3 2024, 11:21 AM

`api-backend` regularly crashing due to OOMClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

`api-backend` regularly crashing due to OOM
Closed, ResolvedPublic
Actions