Page MenuHomePhabricator

[harbor] Stabilize the current installation
Closed, ResolvedPublic

Description

Harbor keeps crashing somewhat often. It looks like the service is almost consistently running with almost all of the 16G ram used which then sometimes gets all used leading to (very slow) swapping.

Event Timeline

taavi created this task.

It is apparently redis eating most of the RAM:

aborrero@tools-harbor-1:~$ sudo ps aux  | grep [r]edis
999         5050  6.3 58.4 9881940 9589176 ?     Ssl  Jul24  57:12 redis-server *:6379

image.png (337×1 px, 67 KB)

The redis logs shows high activity too:

aborrero@tools-harbor-1:~$ sudo docker logs redis
[..]
1:M 25 Jul 2023 08:45:47.087 * 10000 changes in 60 seconds. Saving...
1:M 25 Jul 2023 08:45:47.197 * Background saving started by pid 14240
14240:C 25 Jul 2023 08:46:21.346 * DB saved on disk
14240:C 25 Jul 2023 08:46:21.485 * Fork CoW for RDB: current 124 MB, peak 124 MB, average 87 MB
1:M 25 Jul 2023 08:46:21.644 * Background saving terminated with success
1:M 25 Jul 2023 08:47:22.009 * 10000 changes in 60 seconds. Saving...
1:M 25 Jul 2023 08:47:22.144 * Background saving started by pid 14265
14265:C 25 Jul 2023 08:47:53.719 * DB saved on disk
14265:C 25 Jul 2023 08:47:53.844 * Fork CoW for RDB: current 157 MB, peak 157 MB, average 117 MB
1:M 25 Jul 2023 08:47:54.018 * Background saving terminated with success
1:M 25 Jul 2023 08:48:55.033 * 10000 changes in 60 seconds. Saving...
1:M 25 Jul 2023 08:48:55.171 * Background saving started by pid 14295
14295:C 25 Jul 2023 08:49:28.478 * DB saved on disk
14295:C 25 Jul 2023 08:49:28.608 * Fork CoW for RDB: current 33 MB, peak 33 MB, average 20 MB
1:M 25 Jul 2023 08:49:28.761 * Background saving terminated with success
1:M 25 Jul 2023 08:50:29.080 * 10000 changes in 60 seconds. Saving...
1:M 25 Jul 2023 08:50:29.273 * Background saving started by pid 14319
14319:C 25 Jul 2023 08:51:01.007 * DB saved on disk
14319:C 25 Jul 2023 08:51:01.128 * Fork CoW for RDB: current 160 MB, peak 160 MB, average 105 MB
1:M 25 Jul 2023 08:51:01.320 * Background saving terminated with success

The OOM is redis taking too much memory, now the cause of that is unclear but here's some suspects:

Lots of jobs running quite frequently

We run a job on every repo every 5 min to delete dangling images (so users have space to create new ones), this creates one log for each run, that is 3000projects * (24h * 60m/h / 5m) = 864k job runs (plus a few more for global jobs), this is to some extent cached by harbor (https://github.com/goharbor/harbor/issues/8537)

For this, we can try to try:

  • Using only the global job to clear up dangling images more often (given the amount of projects might also be troublesome, but maybe less, would be better combined with reducing the amount of projects)
  • Reduce the frequency we run the jobs at, this might end up being troublesome for the users as they might not be able to build a new image until the cleanup runs (currently 5min tops)
  • Somehow modify harbor to keep N logs instead of a minimum of 1 day of logs
  • Trigger a cleanup before building a new image from the build service (this will allow removing all the policies from all the projects)

Frequent all project scans

We scan all the projects (and their policies) on every run of maintain-harbor, this forces harbor to load all the info for all the projects every time in cache and keep it there.

Things we could do:

  • Reduce the frequency of iteration, this might mean that users don't get a project quick enough after account creation, or they don't get the policy in place to cleanup dangling images

This would be alleviated by T337386: [builds-api.start] Create the harbor project beforehand if it does not exist, that will only leave the policy creation to the maintain-harbor, something we could think of removing too.

That will also allow to remove all the empty projects and only have those that actually have images on them, alleviating the whole system.

dcaro renamed this task from Harbor keeps freezing to [harbor] Stabilize the current installation.Aug 17 2023, 12:12 PM
dcaro changed the status of subtask T337386: [builds-api.start] Create the harbor project beforehand if it does not exist from Stalled to In Progress.
dcaro changed the status of subtask T344435: [harbor] Cleanup empty projects from Open to In Progress.

this should be fixed now (though still needs careful monitoring to be sure) since the two patches that we expect should address it has been merged.

dcaro changed the task status from Open to In Progress.Aug 30 2023, 7:47 AM
dcaro claimed this task.
dcaro moved this task from Next Up to In Progress on the Toolforge Build Service (Iteration 19) board.

After removing empty projects and flushing redis (redis-cli fushall async), we got it down to <200MB xd

image.png (368×1 px, 245 KB)

Let's wait a bit, but I'd say it's a really nice change.

dcaro moved this task from In Progress to Done on the Toolforge Build Service (Iteration 19) board.

Harbor has been stable for a few days, with a peak of memory usage of ~1G from redis, I think we can close this.