Page MenuHomePhabricator

[harbor] cleanup execution + task tables
Closed, ResolvedPublic

Description

It seems they are not being cleaned up properly, and it's piling up.

The execution_sweep job should be cleaning those but it's failing it seems:

root@tools-harbor-1:~# docker logs -f harbor-jobservice | grep EXECUTION
2024-01-29T07:10:23Z [ERROR] [/jobservice/runner/redis.go:123]: Job 'EXECUTION_SWEEP:cca525ce26b4964cb78176f8' exit with error: run error: {"errors":[{"code":"UNKNOWN","message":"failed to delete executions: ERROR: update or delete on table \"execution\" violates foreign key constraint \"task_execution_id_fkey\" on table \"task\" (SQLSTATE 23503)"}]}
2024-01-29T07:10:23Z [INFO] [/jobservice/worker/cworker/c_worker.go:77]: Job incoming: {"name":"EXECUTION_SWEEP","id":"4cd4e18d8459011f54bc31ea","t":1706385630,"args":null}
2024-01-29T07:10:23Z [INFO] [/jobservice/runner/redis.go:197]: Retrying job EXECUTION_SWEEP:4cd4e18d8459011f54bc31ea, revision: 1706512223
2024-01-29T07:11:48Z [INFO] [/pkg/task/sweep_job.go:150]: [EXECUTION_SWEEP] start to sweep, retain latest 10 executions
2024-01-29T07:11:48Z [INFO] [/pkg/task/sweep_job.go:160]: [EXECUTION_SWEEP] listed 2 candidate executions for sweep
2024-01-29T07:11:48Z [INFO] [/pkg/task/sweep_job.go:180]: [EXECUTION_SWEEP] end to sweep, 2 executions were deleted in total, elapsed time: 46.454234ms
2024-01-29T09:40:06Z [ERROR] [/jobservice/runner/redis.go:123]: Job 'EXECUTION_SWEEP:4cd4e18d8459011f54bc31ea' exit with error: run error: {"errors":[{"code":"UNKNOWN","message":"failed to delete executions: ERROR: update or delete on table \"execution\" violates foreign key constraint \"task_execution_id_fkey\" on table \"task\" (SQLSTATE 23503)"}]}
2024-01-29T09:40:06Z [INFO] [/jobservice/worker/cworker/c_worker.go:77]: Job incoming: {"name":"EXECUTION_SWEEP","id":"9cd1eb3276592fab4cb9344a","t":1706389233,"args":null}
2024-01-29T09:40:06Z [INFO] [/jobservice/runner/redis.go:197]: Retrying job EXECUTION_SWEEP:9cd1eb3276592fab4cb9344a, revision: 1706521206

And the tables (in tools) are getting bigger and bigger:

harbor=> select count(*) from task;
 count  
--------
 265942
(1 row)


harbor=> select count(*) from execution;
  count   
----------
 13025405
(1 row)

Note that the job_log table is empty.

This might include cleaning up non-deleted retention policies from deleted projects:

harbor=> select count(*) from project as p join project_metadata as pm on p.project_id=pm.project_id where p.deleted='t' and pm.name='retention_id';
 count 
-------
  3189
(1 row)

Info for the scheduled jobs - retention policies - schedules:

Tables involved:

  • retention_policy - inside the data json column, you find .scope.level=="project" and .scope.ref that is the project id
  • project - here you can check the deleted==t column to see if it's deleted
  • project_metadata - this has column project_id being project.project_id and column name=="retention_id" with column value being the retention_policy.id
  • schedule - this has column vendor_type="RETENTION" and json column callback_func_param with .PolicyID being the retention_policy.id

See {

1#!/bin/bash
2
3
4export PGPASSWORD="$(grep -A 10 ttg4ncgzifw.svc.trove.eqiad1.wikimedia.cloud harbor.yml | grep password | awk '{print $2}')"
5
6function mypsql() {
7 local command="${1?}"
8 local separator="${2:-_}"
9 psql \
10 -P pager \
11 -h ttg4ncgzifw.svc.trove.eqiad1.wikimedia.cloud \
12 -t \
13 -A \
14 -F"$separator" \
15 -c "$command" \
16 harbor \
17 harbor
18}
19
20
21function delete_retention() {
22 local retention_id="${1?}"
23 mypsql "select * from retention_policy where id='$retention_id'" " | "
24 if [[ "$FOR_REAL" = "yes" ]]; then
25 mypsql "delete from retention_policy where id='$retention_id'" " | "
26 fi
27}
28
29function delete_metadata() {
30 local retention_id="${1?}"
31 mypsql "select * from project_metadata where id='$metadata_id'" " | "
32 if [[ "$FOR_REAL" = "yes" ]]; then
33 mypsql "delete from project_metadata where id='$metadata_id'" " | "
34 fi
35}
36
37function delete_schedule() {
38 local retention_id="${1?}"
39 mypsql "select * from schedule where vendor_type='RETENTION' and (callback_func_param::json ->> 'PolicyID') = '$retention_id'" " | "
40 if [[ "$FOR_REAL" = "yes" ]]; then
41 mypsql "delete from schedule where vendor_type='RETENTION' and (callback_func_param::json ->> 'PolicyID') = '$retention_id'" " | "
42 fi
43}
44
45function main() {
46
47 mypsql \
48 "select p.project_id, id as metadata_id, value as retention_id
49 from project as p join project_metadata as pm on p.project_id=pm.project_id
50 where p.deleted='t' and pm.name='retention_id'" \
51 "_" > /tmp/ids_to_cleanup.txt
52 while read -r idline; do
53 project_id="${idline%%_*}"
54 metadata_id="${idline%_*}"
55 metadata_id="${metadata_id#*_}"
56 retention_id="${idline##*_}"
57 echo "Got project_id: $project_id, retention_id=$retention_id, metadata_id=$metadata_id"
58 delete_schedule "$retention_id"
59 delete_retention "$retention_id"
60 delete_metadata "$metadata_id"
61 echo ""
62 done </tmp/ids_to_cleanup.txt
63}
64
65
66
67main "$@"
}

Info for the execution + task

  • task - whith execution_id being the execution.id and vendor_type=="RETENTION"
  • execution - where id is a foreign key to task.execution_id
DELETE from execution where execution.id not in (select execution_id from task);

#### Changed to a more performant:
## Ran this several times in batches of 1000000
delete from execution where id in (select execution.id from execution left join task on task.execution_id = execution.id where task.execution_id is NULL limit 1000000);

Event Timeline

Some maybe relevant upstream bugs:
https://github.com/goharbor/harbor/issues/17611 -> closed, failing to cleanup execution + job_logs
https://github.com/goharbor/harbor/issues/17255 -> open, not removing the scheduled tasks when deleting a project (they generate the execution entries when running)

dcaro updated the task description. (Show Details)

Easy ones first, cleaning up old executions without a task:

DELETE from execution where execution.id not in (select execution_id from task);
dcaro changed the task status from Open to In Progress.Jan 30 2024, 9:33 AM
dcaro claimed this task.
dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 04) board.

Mentioned in SAL (#wikimedia-cloud) [2024-01-30T09:33:26Z] <dcaro> cleaning up old schedules on harbor (T356037)

Mentioned in SAL (#wikimedia-cloud) [2024-01-30T10:16:57Z] <dcaro> restarting harbor and flushing redis to regenerate cache data (T356037)

After restarting to flush redis:

root@tools-harbor-1:/srv/ops/harbor# docker exec -ti redis redis-cli
127.0.0.1:6379> FLUSHALL async                                                                                                                                                                 
OK                                                                                                                                                                                             
(0.94s)
root@tools-harbor-1:/srv/ops/harbor# docker-compose restart                                                                                                                                    
Restarting harbor-exporter   ... done                                                                                                                                                          
Restarting nginx             ... done                                                                                                                                                          
Restarting harbor-jobservice ... done                                                                                                                                                                                                                                                                                                                                                         
Restarting harbor-core       ... done                                        
Restarting registryctl       ... done                    
Restarting redis             ... done                   
Restarting harbor-portal     ... done                                                                                                                                                          
Restarting registry          ... done                    
Restarting harbor-log        ... done

Everything came up as healthy, but we lost the ability to log-in to harbor.

To solve it I stopped everything, re-run the prepare script, and started everything again:

root@tools-harbor-1:/srv/ops/harbor# docker-compose stop
Stopping harbor-exporter   ... done
Stopping nginx             ... done
Stopping harbor-jobservice ... done
Stopping harbor-core       ... done
Stopping registryctl       ... done
Stopping redis             ... done
Stopping harbor-portal     ... done
Stopping registry          ... done
Stopping harbor-log        ... done
root@tools-harbor-1:/srv/ops/harbor# docker-compose down
Removing harbor-exporter   ... done
Removing nginx             ... done
Removing harbor-jobservice ... done
Removing harbor-core       ... done
Removing registryctl       ... done
Removing redis             ... done
Removing harbor-portal     ... done
Removing registry          ... done
Removing harbor-log        ... done
Removing network harbor_harbor

root@tools-harbor-1:/srv/ops/harbor# ./prepare 
prepare base dir is set to /srv/ops/harbor
WARNING:root:WARNING: HTTP protocol is insecure. Harbor will deprecate http protocol in the future. Please make sure to upgrade to https
Clearing the configuration file: /config/nginx/nginx.conf
Clearing the configuration file: /config/db/env
Clearing the configuration file: /config/exporter/env
Clearing the configuration file: /config/portal/nginx.conf
Clearing the configuration file: /config/jobservice/config.yml
Clearing the configuration file: /config/jobservice/env
Clearing the configuration file: /config/log/logrotate.conf
Clearing the configuration file: /config/log/rsyslog_docker.conf
Clearing the configuration file: /config/registryctl/config.yml
Clearing the configuration file: /config/registryctl/env
Clearing the configuration file: /config/core/app.conf
Clearing the configuration file: /config/core/env
Clearing the configuration file: /config/registry/passwd
Clearing the configuration file: /config/registry/root.crt
Clearing the configuration file: /config/registry/config.yml
Generated configuration file: /config/portal/nginx.conf
Generated configuration file: /config/log/logrotate.conf
Generated configuration file: /config/log/rsyslog_docker.conf
Generated configuration file: /config/nginx/nginx.conf
Generated configuration file: /config/core/env
Generated configuration file: /config/core/app.conf
Generated configuration file: /config/registry/config.yml
Generated configuration file: /config/registryctl/env
Generated configuration file: /config/registryctl/config.yml
Generated configuration file: /config/db/env
Generated configuration file: /config/jobservice/env
Generated configuration file: /config/jobservice/config.yml
loaded secret from file: /data/secret/keys/secretkey
Generated configuration file: /config/exporter/env
Generated configuration file: /compose_location/docker-compose.yml
Clean up the input dir

root@tools-harbor-1:/srv/ops/harbor# docker-compose up -d
Creating network "harbor_harbor" with the default driver
Creating harbor-log ... done
Creating registryctl   ... done
Creating registry      ... done
Creating redis         ... done
Creating harbor-portal ... done
Creating harbor-core   ... done
Creating nginx             ... done
Creating harbor-jobservice ... done
Creating harbor-exporter   ... done

And that worked :/

Update on deleting all the executions, given that it's many many of them, trying now to run a query that would be more performant than using 'not in':

delete from execution where id in (select execution.id from execution left join task on task.execution_id = execution.id where task.execution_id is NULL);
dcaro updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-cloud) [2024-01-30T15:16:53Z] <dcaro> restart harbor now that the db is clean (T356037)

dcaro moved this task from In Progress to Done on the Toolforge (Toolforge iteration 04) board.

\o/ only 60 schedules (down from >3k), 6 tasks and 6 executions left in the database, will monitor it a bit to see if there's any errors in the logs or the schedules don't trigger, but seems nice now :)