[harbor] Stabilize the current installation
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• taavi
	Jul 24 2023, 5:33 PM

Description

Harbor keeps crashing somewhat often. It looks like the service is almost consistently running with almost all of the 16G ram used which then sometimes gets all used leading to (very slow) swapping.

Related Objects
Search...

Status	Assigned	Task
Resolved	bd808	T133777 Tools that should get archived/deleted (tracking)
Resolved	Andrew	T170355 Figure out process for deleting an unused tool
In Progress	Andrew	T332514 Store state information for the disable tool process outside NFS
Resolved	dcaro	T342563 [harbor] Stabilize the current installation
Resolved	Raymond_Ndibe	T337386 [builds-api.start] Create the harbor project beforehand if it does not exist
Open	None	T334629 Update maintain_kubeusers to use the toolstate database
Resolved	dcaro	T334585 [cookbooks.wmcs.toolforge.component.deploy] Add secrets support when deploying
Resolved	dcaro	T304532 buildservice: migrate to helmfile from of kustomize
Resolved	dcaro	T344433 [harbor] See if we can replace the per-project cleanup policy with a harbor-wide one
Resolved	dcaro	T344434 [maintain-harbor] Remove project creation code
Resolved	dcaro	T344435 [harbor] Cleanup empty projects

Event Timeline

• taavi triaged this task as High priority.Jul 24 2023, 5:33 PM

• taavi created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 24 2023, 5:33 PM

It is apparently redis eating most of the RAM:

aborrero@tools-harbor-1:~$ sudo ps aux  | grep [r]edis
999         5050  6.3 58.4 9881940 9589176 ?     Ssl  Jul24  57:12 redis-server *:6379

The redis logs shows high activity too:

aborrero@tools-harbor-1:~$ sudo docker logs redis
[..]
1:M 25 Jul 2023 08:45:47.087 * 10000 changes in 60 seconds. Saving...
1:M 25 Jul 2023 08:45:47.197 * Background saving started by pid 14240
14240:C 25 Jul 2023 08:46:21.346 * DB saved on disk
14240:C 25 Jul 2023 08:46:21.485 * Fork CoW for RDB: current 124 MB, peak 124 MB, average 87 MB
1:M 25 Jul 2023 08:46:21.644 * Background saving terminated with success
1:M 25 Jul 2023 08:47:22.009 * 10000 changes in 60 seconds. Saving...
1:M 25 Jul 2023 08:47:22.144 * Background saving started by pid 14265
14265:C 25 Jul 2023 08:47:53.719 * DB saved on disk
14265:C 25 Jul 2023 08:47:53.844 * Fork CoW for RDB: current 157 MB, peak 157 MB, average 117 MB
1:M 25 Jul 2023 08:47:54.018 * Background saving terminated with success
1:M 25 Jul 2023 08:48:55.033 * 10000 changes in 60 seconds. Saving...
1:M 25 Jul 2023 08:48:55.171 * Background saving started by pid 14295
14295:C 25 Jul 2023 08:49:28.478 * DB saved on disk
14295:C 25 Jul 2023 08:49:28.608 * Fork CoW for RDB: current 33 MB, peak 33 MB, average 20 MB
1:M 25 Jul 2023 08:49:28.761 * Background saving terminated with success
1:M 25 Jul 2023 08:50:29.080 * 10000 changes in 60 seconds. Saving...
1:M 25 Jul 2023 08:50:29.273 * Background saving started by pid 14319
14319:C 25 Jul 2023 08:51:01.007 * DB saved on disk
14319:C 25 Jul 2023 08:51:01.128 * Fork CoW for RDB: current 160 MB, peak 160 MB, average 105 MB
1:M 25 Jul 2023 08:51:01.320 * Background saving terminated with success

dcaro added a subtask: T337386: [builds-api.start] Create the harbor project beforehand if it does not exist.Aug 14 2023, 9:59 AM

The OOM is redis taking too much memory, now the cause of that is unclear but here's some suspects:

Lots of jobs running quite frequently

We run a job on every repo every 5 min to delete dangling images (so users have space to create new ones), this creates one log for each run, that is 3000projects * (24h * 60m/h / 5m) = 864k job runs (plus a few more for global jobs), this is to some extent cached by harbor (https://github.com/goharbor/harbor/issues/8537)

For this, we can try to try:

Using only the global job to clear up dangling images more often (given the amount of projects might also be troublesome, but maybe less, would be better combined with reducing the amount of projects)
Reduce the frequency we run the jobs at, this might end up being troublesome for the users as they might not be able to build a new image until the cleanup runs (currently 5min tops)
Somehow modify harbor to keep N logs instead of a minimum of 1 day of logs
Trigger a cleanup before building a new image from the build service (this will allow removing all the policies from all the projects)

Frequent all project scans

We scan all the projects (and their policies) on every run of maintain-harbor, this forces harbor to load all the info for all the projects every time in cache and keep it there.

Things we could do:

Reduce the frequency of iteration, this might mean that users don't get a project quick enough after account creation, or they don't get the policy in place to cleanup dangling images

This would be alleviated by T337386: [builds-api.start] Create the harbor project beforehand if it does not exist, that will only leave the policy creation to the maintain-harbor, something we could think of removing too.

That will also allow to remove all the empty projects and only have those that actually have images on them, alleviating the whole system.

dcaro edited projects, added Toolforge Build Service (Iteration 18); removed Toolforge.Aug 17 2023, 12:07 PM

dcaro renamed this task from Harbor keeps freezing to [harbor] Stabilize the current installation.Aug 17 2023, 12:12 PM

dcaro changed the status of subtask T337386: [builds-api.start] Create the harbor project beforehand if it does not exist from Stalled to In Progress.

dcaro changed the status of subtask T344434: [maintain-harbor] Remove project creation code from Open to In Progress.Aug 17 2023, 1:47 PM

dcaro changed the status of subtask T344435: [harbor] Cleanup empty projects from Open to In Progress.

dcaro changed the status of subtask T344433: [harbor] See if we can replace the per-project cleanup policy with a harbor-wide one from Open to In Progress.Aug 17 2023, 2:33 PM

dcaro changed the status of subtask T344433: [harbor] See if we can replace the per-project cleanup policy with a harbor-wide one from In Progress to Stalled.Aug 18 2023, 1:42 PM

dcaro closed subtask T337386: [builds-api.start] Create the harbor project beforehand if it does not exist as Resolved.Aug 28 2023, 2:56 PM

Raymond_Ndibe closed subtask T344434: [maintain-harbor] Remove project creation code as Resolved.Aug 28 2023, 6:13 PM

Raymond_Ndibe closed subtask T344435: [harbor] Cleanup empty projects as Resolved.

this should be fixed now (though still needs careful monitoring to be sure) since the two patches that we expect should address it has been merged.

dcaro edited projects, added Toolforge Build Service (Iteration 19); removed Toolforge Build Service (Iteration 18).Aug 30 2023, 7:40 AM

dcaro changed the task status from Open to In Progress.Aug 30 2023, 7:47 AM

dcaro claimed this task.

dcaro moved this task from Next Up to In Progress on the Toolforge Build Service (Iteration 19) board.

dcaro reopened subtask T344435: [harbor] Cleanup empty projects as Open.Aug 30 2023, 8:15 AM

After removing empty projects and flushing redis (redis-cli fushall async), we got it down to <200MB xd

Let's wait a bit, but I'd say it's a really nice change.

Harbor has been stable for a few days, with a peak of memory usage of ~1G from redis, I think we can close this.

dcaro closed subtask T344433: [harbor] See if we can replace the per-project cleanup policy with a harbor-wide one as Resolved.Sep 1 2023, 7:18 AM

	F37642384: image.png
	Aug 30 2023, 9:06 AM

	F37149148: image.png
	Jul 25 2023, 8:38 AM

[harbor] Stabilize the current installationClosed, ResolvedPublicActions