Harbor keeps crashing somewhat often. It looks like the service is almost consistently running with almost all of the 16G ram used which then sometimes gets all used leading to (very slow) swapping.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | bd808 | T133777 Tools that should get archived/deleted (tracking) | |||
Resolved | Andrew | T170355 Figure out process for deleting an unused tool | |||
In Progress | Andrew | T332514 Store state information for the disable tool process outside NFS | |||
Resolved | dcaro | T342563 [harbor] Stabilize the current installation | |||
Resolved | Raymond_Ndibe | T337386 [builds-api.start] Create the harbor project beforehand if it does not exist | |||
Open | None | T334629 Update maintain_kubeusers to use the toolstate database | |||
Resolved | dcaro | T334585 [cookbooks.wmcs.toolforge.component.deploy] Add secrets support when deploying | |||
Resolved | dcaro | T304532 buildservice: migrate to helmfile from of kustomize | |||
Resolved | dcaro | T344433 [harbor] See if we can replace the per-project cleanup policy with a harbor-wide one | |||
Resolved | dcaro | T344434 [maintain-harbor] Remove project creation code | |||
Resolved | dcaro | T344435 [harbor] Cleanup empty projects |
Event Timeline
It is apparently redis eating most of the RAM:
aborrero@tools-harbor-1:~$ sudo ps aux | grep [r]edis 999 5050 6.3 58.4 9881940 9589176 ? Ssl Jul24 57:12 redis-server *:6379
The redis logs shows high activity too:
aborrero@tools-harbor-1:~$ sudo docker logs redis [..] 1:M 25 Jul 2023 08:45:47.087 * 10000 changes in 60 seconds. Saving... 1:M 25 Jul 2023 08:45:47.197 * Background saving started by pid 14240 14240:C 25 Jul 2023 08:46:21.346 * DB saved on disk 14240:C 25 Jul 2023 08:46:21.485 * Fork CoW for RDB: current 124 MB, peak 124 MB, average 87 MB 1:M 25 Jul 2023 08:46:21.644 * Background saving terminated with success 1:M 25 Jul 2023 08:47:22.009 * 10000 changes in 60 seconds. Saving... 1:M 25 Jul 2023 08:47:22.144 * Background saving started by pid 14265 14265:C 25 Jul 2023 08:47:53.719 * DB saved on disk 14265:C 25 Jul 2023 08:47:53.844 * Fork CoW for RDB: current 157 MB, peak 157 MB, average 117 MB 1:M 25 Jul 2023 08:47:54.018 * Background saving terminated with success 1:M 25 Jul 2023 08:48:55.033 * 10000 changes in 60 seconds. Saving... 1:M 25 Jul 2023 08:48:55.171 * Background saving started by pid 14295 14295:C 25 Jul 2023 08:49:28.478 * DB saved on disk 14295:C 25 Jul 2023 08:49:28.608 * Fork CoW for RDB: current 33 MB, peak 33 MB, average 20 MB 1:M 25 Jul 2023 08:49:28.761 * Background saving terminated with success 1:M 25 Jul 2023 08:50:29.080 * 10000 changes in 60 seconds. Saving... 1:M 25 Jul 2023 08:50:29.273 * Background saving started by pid 14319 14319:C 25 Jul 2023 08:51:01.007 * DB saved on disk 14319:C 25 Jul 2023 08:51:01.128 * Fork CoW for RDB: current 160 MB, peak 160 MB, average 105 MB 1:M 25 Jul 2023 08:51:01.320 * Background saving terminated with success
The OOM is redis taking too much memory, now the cause of that is unclear but here's some suspects:
Lots of jobs running quite frequently
We run a job on every repo every 5 min to delete dangling images (so users have space to create new ones), this creates one log for each run, that is 3000projects * (24h * 60m/h / 5m) = 864k job runs (plus a few more for global jobs), this is to some extent cached by harbor (https://github.com/goharbor/harbor/issues/8537)
For this, we can try to try:
- Using only the global job to clear up dangling images more often (given the amount of projects might also be troublesome, but maybe less, would be better combined with reducing the amount of projects)
- Reduce the frequency we run the jobs at, this might end up being troublesome for the users as they might not be able to build a new image until the cleanup runs (currently 5min tops)
- Somehow modify harbor to keep N logs instead of a minimum of 1 day of logs
- Trigger a cleanup before building a new image from the build service (this will allow removing all the policies from all the projects)
Frequent all project scans
We scan all the projects (and their policies) on every run of maintain-harbor, this forces harbor to load all the info for all the projects every time in cache and keep it there.
Things we could do:
- Reduce the frequency of iteration, this might mean that users don't get a project quick enough after account creation, or they don't get the policy in place to cleanup dangling images
This would be alleviated by T337386: [builds-api.start] Create the harbor project beforehand if it does not exist, that will only leave the policy creation to the maintain-harbor, something we could think of removing too.
That will also allow to remove all the empty projects and only have those that actually have images on them, alleviating the whole system.
this should be fixed now (though still needs careful monitoring to be sure) since the two patches that we expect should address it has been merged.
After removing empty projects and flushing redis (redis-cli fushall async), we got it down to <200MB xd
Let's wait a bit, but I'd say it's a really nice change.
Harbor has been stable for a few days, with a peak of memory usage of ~1G from redis, I think we can close this.