[harbor] cleanup execution + task tables
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcaro
	Jan 29 2024, 11:02 AM

Description

It seems they are not being cleaned up properly, and it's piling up.

The execution_sweep job should be cleaning those but it's failing it seems:

root@tools-harbor-1:~# docker logs -f harbor-jobservice | grep EXECUTION
2024-01-29T07:10:23Z [ERROR] [/jobservice/runner/redis.go:123]: Job 'EXECUTION_SWEEP:cca525ce26b4964cb78176f8' exit with error: run error: {"errors":[{"code":"UNKNOWN","message":"failed to delete executions: ERROR: update or delete on table \"execution\" violates foreign key constraint \"task_execution_id_fkey\" on table \"task\" (SQLSTATE 23503)"}]}
2024-01-29T07:10:23Z [INFO] [/jobservice/worker/cworker/c_worker.go:77]: Job incoming: {"name":"EXECUTION_SWEEP","id":"4cd4e18d8459011f54bc31ea","t":1706385630,"args":null}
2024-01-29T07:10:23Z [INFO] [/jobservice/runner/redis.go:197]: Retrying job EXECUTION_SWEEP:4cd4e18d8459011f54bc31ea, revision: 1706512223
2024-01-29T07:11:48Z [INFO] [/pkg/task/sweep_job.go:150]: [EXECUTION_SWEEP] start to sweep, retain latest 10 executions
2024-01-29T07:11:48Z [INFO] [/pkg/task/sweep_job.go:160]: [EXECUTION_SWEEP] listed 2 candidate executions for sweep
2024-01-29T07:11:48Z [INFO] [/pkg/task/sweep_job.go:180]: [EXECUTION_SWEEP] end to sweep, 2 executions were deleted in total, elapsed time: 46.454234ms
2024-01-29T09:40:06Z [ERROR] [/jobservice/runner/redis.go:123]: Job 'EXECUTION_SWEEP:4cd4e18d8459011f54bc31ea' exit with error: run error: {"errors":[{"code":"UNKNOWN","message":"failed to delete executions: ERROR: update or delete on table \"execution\" violates foreign key constraint \"task_execution_id_fkey\" on table \"task\" (SQLSTATE 23503)"}]}
2024-01-29T09:40:06Z [INFO] [/jobservice/worker/cworker/c_worker.go:77]: Job incoming: {"name":"EXECUTION_SWEEP","id":"9cd1eb3276592fab4cb9344a","t":1706389233,"args":null}
2024-01-29T09:40:06Z [INFO] [/jobservice/runner/redis.go:197]: Retrying job EXECUTION_SWEEP:9cd1eb3276592fab4cb9344a, revision: 1706521206

And the tables (in tools) are getting bigger and bigger:

harbor=> select count(*) from task;
 count  
--------
 265942
(1 row)


harbor=> select count(*) from execution;
  count   
----------
 13025405
(1 row)

Note that the job_log table is empty.

This might include cleaning up non-deleted retention policies from deleted projects:

harbor=> select count(*) from project as p join project_metadata as pm on p.project_id=pm.project_id where p.deleted='t' and pm.name='retention_id';
 count 
-------
  3189
(1 row)

Info for the scheduled jobs - retention policies - schedules:

Tables involved:

retention_policy - inside the data json column, you find .scope.level=="project" and .scope.ref that is the project id
project - here you can check the deleted==t column to see if it's deleted
project_metadata - this has column project_id being project.project_id and column name=="retention_id" with column value being the retention_policy.id
schedule - this has column vendor_type="RETENTION" and json column callback_func_param with .PolicyID being the retention_policy.id

See {

P55825 cleanup_harbor_schedules.sh

1	#!/bin/bash
2
3
4	export PGPASSWORD="$(grep -A 10 ttg4ncgzifw.svc.trove.eqiad1.wikimedia.cloud harbor.yml \| grep password \| awk '{print $2}')"
5
6	function mypsql() {
7	local command="${1?}"
8	local separator="${2:-_}"
9	psql \
10	-P pager \
11	-h ttg4ncgzifw.svc.trove.eqiad1.wikimedia.cloud \
12	-t \
13	-A \
14	-F"$separator" \
15	-c "$command" \
16	harbor \
17	harbor
18	}
19
20
21	function delete_retention() {
22	local retention_id="${1?}"
23	mypsql "select * from retention_policy where id='$retention_id'" " \| "
24	if [[ "$FOR_REAL" = "yes" ]]; then
25	mypsql "delete from retention_policy where id='$retention_id'" " \| "
26	fi
27	}
28
29	function delete_metadata() {
30	local retention_id="${1?}"
31	mypsql "select * from project_metadata where id='$metadata_id'" " \| "
32	if [[ "$FOR_REAL" = "yes" ]]; then
33	mypsql "delete from project_metadata where id='$metadata_id'" " \| "
34	fi
35	}
36
37	function delete_schedule() {
38	local retention_id="${1?}"
39	mypsql "select * from schedule where vendor_type='RETENTION' and (callback_func_param::json ->> 'PolicyID') = '$retention_id'" " \| "
40	if [[ "$FOR_REAL" = "yes" ]]; then
41	mypsql "delete from schedule where vendor_type='RETENTION' and (callback_func_param::json ->> 'PolicyID') = '$retention_id'" " \| "
42	fi
43	}
44
45	function main() {
46
47	mypsql \
48	"select p.project_id, id as metadata_id, value as retention_id
49	from project as p join project_metadata as pm on p.project_id=pm.project_id
50	where p.deleted='t' and pm.name='retention_id'" \
51	"_" > /tmp/ids_to_cleanup.txt
52	while read -r idline; do
53	project_id="${idline%%_*}"
54	metadata_id="${idline%_*}"
55	metadata_id="${metadata_id#*_}"
56	retention_id="${idline##*_}"
57	echo "Got project_id: $project_id, retention_id=$retention_id, metadata_id=$metadata_id"
58	delete_schedule "$retention_id"
59	delete_retention "$retention_id"
60	delete_metadata "$metadata_id"
61	echo ""
62	done </tmp/ids_to_cleanup.txt
63	}
64
65
66
67	main "$@"

}

Info for the execution + task

task - whith execution_id being the execution.id and vendor_type=="RETENTION"
execution - where id is a foreign key to task.execution_id

DELETE from execution where execution.id not in (select execution_id from task);

#### Changed to a more performant:
## Ran this several times in batches of 1000000
delete from execution where id in (select execution.id from execution left join task on task.execution_id = execution.id where task.execution_id is NULL limit 1000000);

Related Objects

Mentioned In: T354176: [harbor] Redis using all available memory
Mentioned Here: P55825 cleanup_harbor_schedules.sh

Event Timeline

dcaro created this task.Jan 29 2024, 11:02 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 29 2024, 11:02 AM

Some maybe relevant upstream bugs:
https://github.com/goharbor/harbor/issues/17611 -> closed, failing to cleanup execution + job_logs
https://github.com/goharbor/harbor/issues/17255 -> open, not removing the scheduled tasks when deleting a project (they generate the execution entries when running)

dcaro updated the task description. (Show Details)Jan 29 2024, 11:05 AM

dcaro updated the task description. (Show Details)Jan 29 2024, 11:48 AM

dcaro updated the task description. (Show Details)

Easy ones first, cleaning up old executions without a task:

DELETE from execution where execution.id not in (select execution_id from task);

dcaro updated the task description. (Show Details)Jan 29 2024, 5:20 PM

dcaro changed the task status from Open to In Progress.Jan 30 2024, 9:33 AM

dcaro claimed this task.

dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 04) board.

Mentioned in SAL (#wikimedia-cloud) [2024-01-30T09:33:26Z] <dcaro> cleaning up old schedules on harbor (T356037)

Mentioned in SAL (#wikimedia-cloud) [2024-01-30T10:16:57Z] <dcaro> restarting harbor and flushing redis to regenerate cache data (T356037)

After restarting to flush redis:

root@tools-harbor-1:/srv/ops/harbor# docker exec -ti redis redis-cli
127.0.0.1:6379> FLUSHALL async                                                                                                                                                                 
OK                                                                                                                                                                                             
(0.94s)

root@tools-harbor-1:/srv/ops/harbor# docker-compose restart                                                                                                                                    
Restarting harbor-exporter   ... done                                                                                                                                                          
Restarting nginx             ... done                                                                                                                                                          
Restarting harbor-jobservice ... done                                                                                                                                                                                                                                                                                                                                                         
Restarting harbor-core       ... done                                        
Restarting registryctl       ... done                    
Restarting redis             ... done                   
Restarting harbor-portal     ... done                                                                                                                                                          
Restarting registry          ... done                    
Restarting harbor-log        ... done

Everything came up as healthy, but we lost the ability to log-in to harbor.

To solve it I stopped everything, re-run the prepare script, and started everything again:

root@tools-harbor-1:/srv/ops/harbor# docker-compose stop
Stopping harbor-exporter ... done
Stopping nginx ... done
Stopping harbor-jobservice ... done
Stopping harbor-core ... done
Stopping registryctl ... done
Stopping redis ... done
Stopping harbor-portal ... done
Stopping registry ... done
Stopping harbor-log ... done
root@tools-harbor-1:/srv/ops/harbor# docker-compose down
Removing harbor-exporter ... done
Removing nginx ... done
Removing harbor-jobservice ... done
Removing harbor-core ... done
Removing registryctl ... done
Removing redis ... done
Removing harbor-portal ... done
Removing registry ... done
Removing harbor-log ... done
Removing network harbor_harbor

root@tools-harbor-1:/srv/ops/harbor# ./prepare
prepare base dir is set to /srv/ops/harbor
WARNING:root:WARNING: HTTP protocol is insecure. Harbor will deprecate http protocol in the future. Please make sure to upgrade to https
Clearing the configuration file: /config/nginx/nginx.conf
Clearing the configuration file: /config/db/env
Clearing the configuration file: /config/exporter/env
Clearing the configuration file: /config/portal/nginx.conf
Clearing the configuration file: /config/jobservice/config.yml
Clearing the configuration file: /config/jobservice/env
Clearing the configuration file: /config/log/logrotate.conf
Clearing the configuration file: /config/log/rsyslog_docker.conf
Clearing the configuration file: /config/registryctl/config.yml
Clearing the configuration file: /config/registryctl/env
Clearing the configuration file: /config/core/app.conf
Clearing the configuration file: /config/core/env
Clearing the configuration file: /config/registry/passwd
Clearing the configuration file: /config/registry/root.crt
Clearing the configuration file: /config/registry/config.yml
Generated configuration file: /config/portal/nginx.conf
Generated configuration file: /config/log/logrotate.conf
Generated configuration file: /config/log/rsyslog_docker.conf
Generated configuration file: /config/nginx/nginx.conf
Generated configuration file: /config/core/env
Generated configuration file: /config/core/app.conf
Generated configuration file: /config/registry/config.yml
Generated configuration file: /config/registryctl/env
Generated configuration file: /config/registryctl/config.yml
Generated configuration file: /config/db/env
Generated configuration file: /config/jobservice/env
Generated configuration file: /config/jobservice/config.yml
loaded secret from file: /data/secret/keys/secretkey
Generated configuration file: /config/exporter/env
Generated configuration file: /compose_location/docker-compose.yml
Clean up the input dir

root@tools-harbor-1:/srv/ops/harbor# docker-compose up -d
Creating network "harbor_harbor" with the default driver
Creating harbor-log ... done
Creating registryctl ... done
Creating registry ... done
Creating redis ... done
Creating harbor-portal ... done
Creating harbor-core ... done
Creating nginx ... done
Creating harbor-jobservice ... done
Creating harbor-exporter ... done

And that worked :/

dcaro mentioned this in T354176: [harbor] Redis using all available memory.Jan 30 2024, 10:53 AM

Update on deleting all the executions, given that it's many many of them, trying now to run a query that would be more performant than using 'not in':

delete from execution where id in (select execution.id from execution left join task on task.execution_id = execution.id where task.execution_id is NULL);

dcaro updated the task description. (Show Details)Jan 30 2024, 11:19 AM

dcaro updated the task description. (Show Details)Jan 30 2024, 11:35 AM