Page MenuHomePhabricator

[harbor] Redis using all available memory
Closed, ResolvedPublicBUG REPORT

Assigned To
Authored By
fnegri
Jan 2 2024, 10:40 AM
Referenced Files
F41709936: image.png
Jan 23 2024, 10:53 AM
F41651533: image.png
Jan 5 2024, 1:36 PM
F41650292: image.png
Jan 4 2024, 8:20 AM
F41649531: Screenshot 2024-01-03 at 11.31.50.png
Jan 3 2024, 10:32 AM
F41648690: Screenshot 2024-01-02 at 12.49.37.png
Jan 2 2024, 11:50 AM
F41648666: image.png
Jan 2 2024, 11:18 AM

Description

tools-harbor-1 is crashing often because Redis takes up all the available memory.

The alert "Project tools instance tools-harbor-1 is down" has fired multiple times in the past week.

top shows redis-server is using 93% or RAM, load average is very high, and kswapd is using 18% CPU, but free doesn't show any swap being used.

top - 10:25:23 up 11 min,  2 users,  load average: 51.97, 42.77, 23.21
Tasks: 189 total,   2 running, 187 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.8 us,  6.8 sy,  0.0 ni,  0.4 id, 89.9 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  16010.4 total,    150.9 free,  15808.9 used,     50.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.     10.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    587 root      20   0 1781280  18956      0 S  24.8   0.1   0:25.26 contain+
     93 root      20   0       0      0      0 R  18.4   0.0   4:28.16 kswapd0
    580 prometh+  20   0 1753576  35632      0 D   6.2   0.2   0:39.25 prometh+
   3759 10000     20   0 1372824  72652      0 S   5.8   0.4   1:59.40 harbor_+
   1512 10000     20   0  730296   9732      0 S   5.6   0.1   0:22.18 harbor_+
   4903 10000     20   0 1674324 359736      0 D   5.6   2.2   1:57.93 harbor_+
    637 root      20   0 3310224  65940      0 S   4.7   0.4   0:36.64 dockerd
   5436 root      20   0   17364   3856    504 D   1.5   0.0   0:03.17 htop
   1522 10000     20   0  729856   5888      0 S   1.3   0.0   0:07.76 registr+
   2341 root      20   0 1303092   8796      0 S   1.1   0.1   0:06.91 prometh+
      1 root      20   0  164376   2984      0 D   0.9   0.0   0:06.78 systemd
    280 root      20   0   64764   1076      4 D   0.9   0.0   0:05.66 systemd+
   4872 root      20   0 1452068   4528      0 S   0.9   0.0   0:06.84 contain+
   1760 root      20   0 1453220   5452      0 S   0.8   0.0   0:01.11 contain+
   1835 999       20   0   15.7g  14.6g      0 S   0.8  93.4   1:41.24 redis-s+
   5468 root      20   0   13032   4100    600 D   0.8   0.0   0:01.90 wmf-aut+
   5484 root      20   0   17616    368     28 D   0.8   0.0   0:00.24 exim4

root@tools-harbor-1:~# free
               total        used        free      shared  buff/cache   available
Mem:        16394604    16186616      155984        1164       52004       11976
Swap:              0           0           0

Pasting a quick link to the free memory Grafana chart: https://grafana.wmcloud.org/goto/bBFK8SKIk?orgId=1

Event Timeline

fnegri triaged this task as High priority.Jan 2 2024, 10:41 AM

Mentioned in SAL (#wikimedia-cloud) [2024-01-02T10:42:51Z] <dcaro> flushed the redis db on tools-harbor-1 (T354176)

root@tools-harbor-1:~# journalctl -n 10000 |grep Killed
Jan 02 07:57:18 tools-harbor-1 kernel: Out of memory: Killed process 244209 (redis-server) total-vm:16464084kB, anon-rss:15318476kB, file-rss:0kB, shmem-rss:0kB, UID:999 pgtables:30096kB oom_score_adj:0
Jan 02 08:56:18 tools-harbor-1 kernel: Out of memory: Killed process 262071 (redis-server) total-vm:16464484kB, anon-rss:15335456kB, file-rss:0kB, shmem-rss:0kB, UID:999 pgtables:30128kB oom_score_adj:0
Jan 02 08:56:18 tools-harbor-1 kernel: Out of memory: Killed process 262083 (redis-server) total-vm:16540244kB, anon-rss:15335172kB, file-rss:0kB, shmem-rss:0kB, UID:999 pgtables:30140kB oom_score_adj:0
Jan 02 08:56:18 tools-harbor-1 kernel: Out of memory: Killed process 262086 (redis-server) total-vm:16540260kB, anon-rss:15335396kB, file-rss:0kB, shmem-rss:0kB, UID:999 pgtables:30140kB oom_score_adj:0
Jan 02 08:56:18 tools-harbor-1 kernel: Out of memory: Killed process 256026 (redis-server) total-vm:16539872kB, anon-rss:15335064kB, file-rss:0kB, shmem-rss:0kB, UID:999 pgtables:30144kB oom_score_adj:0
Jan 02 10:13:26 tools-harbor-1 kernel: Out of memory: Killed process 262197 (redis-server) total-vm:16464076kB, anon-rss:15347316kB, file-rss:0kB, shmem-rss:0kB, UID:999 pgtables:30156kB oom_score_adj:0
dcaro changed the task status from Open to In Progress.Jan 2 2024, 10:44 AM
dcaro claimed this task.
dcaro added a project: User-dcaro.
dcaro moved this task from To refine to Doing on the User-dcaro board.

I have stopped harbor while the VM was starting:

root@tools-harbor-1:/srv/ops/harbor# docker-compose down

Then started up redis only:

root@tools-harbor-1:/srv/ops/harbor# docker-compose up -d redis
Creating network "harbor_harbor" with the default driver
Creating harbor-log ... done
Creating redis      ... done

And started a shell to connect to the db while it's loading the keys on startup:

root@tools-harbor-1:/srv/ops/harbor# docker exec -ti redis bash
redis [ ~ ]$ redis-cli
127.0.0.1:6379> info
...
# Keyspace
db0:keys=86724,expires=84749,avg_ttl=0
db1:keys=65,expires=65,avg_ttl=0
db2:keys=282164,expires=276869,avg_ttl=0

The db2 is the one getting out of hand:

# Keyspace                                                                                                                                                                                                                                                                                                                                                                                    
db0:keys=86724,expires=84749,avg_ttl=0                                                                                                                                                                                                                                                                                                                                                        
db1:keys=65,expires=65,avg_ttl=0                                                                                                                                                                                                                                                                                                                                                              
db2:keys=555330,expires=544869,avg_ttl=0

So I truncated all the dbs (we lose the cleanup runs logs):

127.0.0.1:6379> FLUSHALL async

And that freed a lot of memory:

root@tools-harbor-1:/srv/ops/harbor# free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi       865Mi        13Gi       1.0Mi       1.6Gi        14Gi
Swap:             0B          0B          0B

We might want to reconsider T344433: [harbor] See if we can replace the per-project cleanup policy with a harbor-wide one

Some docs about memory management: https://redis.io/docs/management/optimization/memory-optimization/

If maxmemory is not set Redis will keep allocating memory as it sees fit and thus it can (gradually) eat up all your free memory. Therefore it is generally advisable to configure some limits. You may also want to set maxmemory-policy to noeviction (which is not the default value in some older versions of Redis).
It makes Redis return an out-of-memory error for write commands if and when it reaches the limit - which in turn may result in errors in the application but will not render the whole machine dead because of memory starvation.

Mentioned in SAL (#wikimedia-cloud) [2024-01-02T11:06:06Z] <dcaro> restart toolsdb database to flush connections (T354176)

Added a 1h log deletion policy to harbor:

image.png (419×884 px, 30 KB)

dcaro changed the task status from In Progress to Stalled.Jan 3 2024, 10:25 AM

There's still a downward trend that is likely to result in OOM after a few days :/

Screenshot 2024-01-03 at 11.31.50.png (1×2 px, 242 KB)

Looking 90 days far, the rate at which the memory was being used was ~0.089 GB/day, currently is around 0.065 GB/day, this will take ~198 days to exhaust (a bit more than 3 months).

Let's wait a few days see if it stabilizes, but it should have done it already I think (as I set the cleanup to keep only 1h of logs).

Looking 90 days far, the rate at which the memory was being used was ~0.089 GB/day, currently is around 0.065 GB/day, this will take ~198 days to exhaust (a bit more than 3 months).

Let's wait a few days see if it stabilizes, but it should have done it already I think (as I set the cleanup to keep only 1h of logs).

Wait, that makes no sense, the current rate is per-hour xd

that is, 0.065 GB/hour, that is 1.56GB/day, that makes it a bit more than 8 days to exhaust the current memory, we can wait a few days but yes, let's keep an eye.

There has been a substantial increase in the number of pulls on harbor side https://grafana.wmcloud.org/d/m9V1RQs4k/harbor-overview?orgId=1&from=now-30d&to=now

But I think it should not have been enough to see such increase of memory usage.

Memory usage has started flattening out \o/, crossing fingers:

image.png (899×1 px, 58 KB)

Still going down slowly, but now it also goes up from time to time:

image.png (1×3 px, 108 KB)

taavi added a subscriber: Andrew.
taavi subscribed.

Is there a reason we could not configure the maximum memory limit for Redis?

Is there a reason we could not configure the maximum memory limit for Redis?

We probably could, we have not because it would not improve much the situation and it requires complicating the configuration (even if just a bit, though would have to be investigated).

Today the minimum free memory hit 12.1 GB, seems to still be going down very slowly, will keep looking.

Is there a reason we could not configure the maximum memory limit for Redis?

According to the docs, it will lead to a Redis error instead of an OOM, which would likely cause Harbor to crash anyway. I'm not sure that type of Harbor crash would trigger an alert, while the OOM does, so that's one reason I'm not sure if setting the limit in the Redis config is a net gain.

Having a quick look at the graph (there was a hiccup on the last outage though), it seems that it takes ~6 days/GB, and there's 11.8GB left, let's revisit this in three weeks.

Still seems to be going down, but slower and slower:

image.png (1×2 px, 51 KB)

This might be related to T356037: [harbor] cleanup execution + task tables , flushed redis again because of that one.

After the cleanup, the redis memory usage is stable with >14G of ram free... so closing this :)