See incident document https://docs.google.com/document/d/1s1f4Lss1Znw3oUav3We6c09QXAY7ja00IsOQyVuQQtw/edit
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Raymond_Ndibe | T370843 [harbor] 2024-07-24 Tools harbor db out of space | |||
| Resolved | dcaro | T370845 [harbor] Investigate how to deactivate wal from trove for postrges databases | |||
| Open | None | T354728 [trove] does not expose amount of disk space used |
Event Timeline
The culprit tables using space is the old executions and task:
harbor=> select schemaname as table_schema,
relname as table_name,
pg_size_pretty(pg_relation_size(relid)) as data_size
from pg_catalog.pg_statio_user_tables
order by pg_relation_size(relid) desc;
table_schema | table_name | data_size
--------------+-----------------------------+------------
public | execution | 1952 MB
public | task | 612 MB
public | schedule | 9880 kB
public | retention_policy | 1784 kB
public | artifact_blob | 1416 kB
public | project_metadata | 1064 kB
public | project | 936 kB
public | project_member | 792 kB
public | cve_allowlist | 728 kB
public | blob | 584 kB
public | quota_usage | 424 kB
public | quota | 344 kB
public | artifact | 320 kB
public | project_blob | 192 kB
public | tag | 40 kB
public | audit_log | 32 kB
public | repository | 24 kB
public | job_queue_status | 8192 bytes
public | schema_migrations | 8192 bytes
public | alembic_version | 8192 bytes
public | role | 8192 bytes
public | data_migrations | 8192 bytes
public | access | 8192 bytes
public | role_permission | 8192 bytes
public | permission_policy | 8192 bytes
public | robot | 8192 bytes
public | immutable_tag_rule | 8192 bytes
public | artifact_trash | 8192 bytes
public | harbor_user | 8192 bytes
public | properties | 8192 bytes
public | p2p_preheat_instance | 0 bytesThe cleanup might not be working correctly
A manual run succeeded, but cleaned up nothing:
2024-07-24T08:40:04Z [INFO] [/jobservice/job/impl/purge/purge.go:93]: Purge audit job start 2024-07-24T08:40:04Z [INFO] [/jobservice/job/impl/purge/purge.go:94]: job parameters map[audit_retention_hour:1 dry_run:false include_operations:create,delete,pull] 2024-07-24T08:40:04Z [INFO] [/jobservice/job/impl/purge/purge.go:114]: Purge operation parameter, retention_hour=1, include_operations:[create delete pull], dry_run:false 2024-07-24T08:40:04Z [INFO] [/jobservice/job/impl/purge/purge.go:119]: Purged 0 rows of audit logs
Manually disable wal (like https://phabricator.wikimedia.org/T343683) and restarted the DB
There's some options we can tweak from trove directly, but they are not the same I manually changed, I'll open a task to investigate
Restarting the jobservice container and looking at the logs shows some errors on startup, that affect the EXECUTION_SWEEP job (that should be cleaning up that table):
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: SYSTEM_ARTIFACT_CLEANUP:0 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: EXECUTION_SWEEP:-1 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:2144 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:545 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:2641 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3213 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:1341 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:1448 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3212 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3244 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3175 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3241 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3221 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3229 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3211 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3238 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:564 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3225 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3208 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3210 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3240 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3228 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3232 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:1 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3145 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:219 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3239 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3220 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3237 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3243 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:1736 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3226 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:2848 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3217 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3224 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:947 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3234 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3245 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3219 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3006 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3222 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3227 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3230 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:2625 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3242 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3223 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3235 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3236 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3231 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3246 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3209 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:950 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3218 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3233 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3216 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3164 2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3215
I did a quick cleanup of the db manually:
harbor=> delete from task where vendor_type in('IMAGE_SCAN', 'RETENTION') AND status not in('Pending', 'Scheduled', 'Running') AND creation_time< NOW() - INTERVAL '1 DAY';
...
harbor=> DELETE from execution where execution.id not in (select execution_id from task);
...
harbor=> vacuum full analyze;
...
harbor=> select schemaname as table_schema,
relname as table_name,
pg_size_pretty(pg_relation_size(relid)) as data_size
from pg_catalog.pg_statio_user_tables
order by pg_relation_size(relid) desc;
table_schema | table_name | data_size
--------------+-----------------------------+------------
public | task | 6336 kB
public | execution | 5144 kB
public | project | 800 kB
...So we passed from ~2G to 6M, next step make the EXECUTION_SWEEP schedule run so we don't have to manually do this.
So there's three related tables in the postrges database, execution, task and schedule, where the vendor_type is EXECUTION_SWEEP.
The code that throws the error we see on startup of harbor-jobs is:
https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/jobservice/sync/schedule.go#L362
It's iterating over the schedules, then getting the first execution that matches the vendor_id/vendor_type tuple, and from that the first task with that execution_id. I can do that manually in the DB without issues:
harbor=> select execution.id from schedule left join execution using(vendor_type,vendor_id) where vendor_type='EXECUTION_SWEEP' limit 1;
id
-----------
241475809
(1 row)
harbor=> select * from task where execution_id=241475809;
id | execution_id | job_id | status | status_code | status_revision | status_message | run_count | extra_attrs | creation_time | start_time | update_time | end_time | vendor_type
---------+--------------+--------------------------+---------+-------------+-----------------+----------------+-----------+-------------+----------------------------+---------------------+---------------------+---------------------+-----------------
2099342 | 241475809 | 9f22b791b5ccf21163071c35 | Success | 3 | 1706446838 | | 1 | {} | 2024-01-28 13:00:38.731168 | 2024-01-28 13:00:38 | 2024-01-29 13:13:47 | 2024-01-29 13:10:38 | EXECUTION_SWEEP
(1 row)Then the code goes on to load the policies from redis for that, but it fails already when trying to get the execution.
To get to the DB:
root@tools-harbor-1:~# cat /srv/ops/harbor/harbor.yml | grep '^ password:'| cut -d: -f2; psql -h 5ujoynvlt5c.svc.trove.eqiad1.wikimedia.cloud harbor harbor
jobs are stored in redis:
root@toolsbeta-harbor-1:~# docker exec -ti redis redis-cli -n 2
127.0.0.1:6379[2]> scan 0 match *9ab67b659a81a000d4a945e4* count 10000
1) "0"
2) 1) "{harbor_job_service_namespace}:job_stats:9ab67b659a81a000d4a945e4"
127.0.0.1:6379[2]> zrange {harbor_job_service_namespace}:job_stats:9ab67b659a81a000d4a945e4 0 -1
(error) WRONGTYPE Operation against a key holding the wrong kind of value
127.0.0.1:6379[2]> get {harbor_job_service_namespace}:job_stats:9ab67b659a81a000d4a945e4
(error) WRONGTYPE Operation against a key holding the wrong kind of value
127.0.0.1:6379[2]> info {harbor_job_service_namespace}:job_stats:9ab67b659a81a000d4a945e4
127.0.0.1:6379[2]> getrange {harbor_job_service_namespace}:job_stats:9ab67b659a81a000d4a945e4 0 -1
(error) WRONGTYPE Operation against a key holding the wrong kind of value
127.0.0.1:6379[2]> type {harbor_job_service_namespace}:job_stats:9ab67b659a81a000d4a945e4
hash
127.0.0.1:6379[2]> HGETALL {harbor_job_service_namespace}:job_stats:9ab67b659a81a000d4a945e4
1) "cron_spec"
2) ""
3) "unique"
4) "0"
5) "enqueue_time"
6) "1723453204"
7) "ack"
8) "{\"status\":\"Success\",\"revision\":1723453204,\"check_in_at\":0}"
9) "update_time"
10) "1723453204"
11) "web_hook_url"
12) "http://core:8080/service/notifications/tasks/2623104"
13) "numeric_policy_id"
14) "0"
15) "id"
16) "9ab67b659a81a000d4a945e4"
17) "ref_link"
18) "/api/v1/jobs/9ab67b659a81a000d4a945e4"
19) "name"
20) "EXECUTION_SWEEP"
21) "kind"
22) "Generic"
23) "revision"
24) "1723453204"
25) "status"
26) "Success"
27) "parameters"
28) "{\"execution_retain_counts\":{\"EXECUTION_SWEEP\":10,\"GARBAGE_COLLECTION\":50,\"IMAGE_SCAN\":1,\"P2P_PREHEAT\":50,\"PURGE_AUDIT_LOG\":10,\"REPLICATION\":50,\"RETENTION\":50,\"SCAN_ALL\":1,\"SCAN_DATA_EXPORT\":50,\"SLACK\":50,\"SYSTEM_ARTIFACT_CLEANUP\":50,\"WEBHOOK\":50}}"
29) "run_at"
30) "0"And tools does not have an entry for any of the execution_sweep jobs:
harbor=> select * from task where execution_id in (select execution.id from schedule left join execution using(vendor_type,vendor_id) where vendor_type='EXECUTION_SWEEP' limit 1);
id | execution_id | job_id | status | status_code | status_revision | status_message | run_count | extra_attrs | creation_time | start_time | update_time | end_time | vendor_type
---------+--------------+--------------------------+---------+-------------+-----------------+----------------+-----------+-------------+----------------------------+---------------------+---------------------+---------------------+-----------------
2099342 | 241475809 | 9f22b791b5ccf21163071c35 | Success | 3 | 1706446838 | | 1 | {} | 2024-01-28 13:00:38.731168 | 2024-01-28 13:00:38 | 2024-01-29 13:13:47 | 2024-01-29 13:10:38 | EXECUTION_SWEEP
(1 row)
harbor=>
\q
root@tools-harbor-1:~# docker exec -ti redis redis-cli -n 2
127.0.0.1:6379[2]> scan 0 match *9f22b791b5ccf21163071c35* count 1000
1) "65312"
2) (empty array)I played with this a bit and it looks like we missed some important things.
- We mistakenly thought that the error in the logs can't be reproduced manually because we can successfully perform the query harbor code is having issues with manually on our end without any problem, turns out that's not exactly true.
the assumption is that harbor is iterating over schedules and getting vendor_type and vendor_id, then directly using those to query the execution table. What it does instead is:
- iterate over schedules and for each schedule get the schedule.id.
- Using this schedule.id as vendor_id and the constant jobNameScheduler (value of which is SCHEDULER) as vendor_type in the execution table, query the execution table for matching record.
this can be seen in the code by looking more closely at the block https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/jobservice/sync/schedule.go#L364
attempt to reproduce the above explanation was also successful.
Toolsbeta
harbor=> select * from schedule where vendor_id=-1 and vendor_type='EXECUTION_SWEEP';
id | creation_time | update_time | vendor_type | vendor_id | cron | callback_func_name | callback_func_param | cron_type | extra_attrs | revision
----+----------------------------+----------------------------+-----------------+-----------+-------------+--------------------------+---------------------+-----------+-------------+------------
19 | 2023-10-20 10:23:53.947166 | 2023-10-20 10:23:53.947166 | EXECUTION_SWEEP | -1 | 0 0 * * * * | EXECUTION_SWEEP_CALLBACK | null | Custom | null | 1723500003
(1 row)
harbor=> select * from execution where vendor_id=19 and vendor_type='SCHEDULER';
id | vendor_type | vendor_id | status | status_message | trigger | extra_attrs | start_time | end_time | revision | update_time
--------+-------------+-----------+---------+----------------+---------+-------------+----------------------------+---------------------+----------+---------------------
562701 | SCHEDULER | 19 | Success | | MANUAL | {} | 2023-10-20 10:23:53.948532 | 2024-08-12 22:00:03 | 7147 | 2024-08-12 22:00:10
(1 row)Tools
harbor=> select * from schedule where vendor_id=-1 and vendor_type='EXECUTION_SWEEP'; id | creation_time | update_time | vendor_type | vendor_id | cron | callback_func_name | callback_func_param | cron_type | extra_attrs | revision ------+----------------------------+----------------------------+-----------------+-----------+-------------+--------------------------+---------------------+-----------+-------------+------------ 3218 | 2023-10-25 08:36:26.460075 | 2023-10-25 08:36:26.460075 | EXECUTION_SWEEP | -1 | 0 0 * * * * | EXECUTION_SWEEP_CALLBACK | null | Custom | null | 1706536837 (1 row) harbor=> select * from execution where vendor_id=3218 and vendor_type='SCHEDULER'; id | vendor_type | vendor_id | status | status_message | trigger | extra_attrs | start_time | end_time | revision | update_time ----+-------------+-----------+--------+----------------+---------+-------------+------------+----------+----------+------------- (0 rows)
The above discovery is significant because worker.getTask is invoked by worker.Run. worker.Run itself is invoked in an infinite for loop by worker.Start and only exists out of the for loop when worker.Run returns no error. When no execution is found by worker.getTask, it returns no execution found * error to worker.Run, which in turn returns the same error to worker.Start, as a result we will never get out of the worker.Start infinite for-loop (I am not sure though that this infinite for-loop error is problem because worker.Start invokes worker.Run in a non-blocking way)
You can see this loop behavior with the query docker logs harbor-jobservice | grep -F '[ERROR] [/jobservice/sync/schedule.go' | head -n 10000. Observe how it keeps repeating the same things over and over.
Possible next steps
- Figure out why their is no record of the schedule in the execution table. Did we remove the records manually? Is it that for some reason harbor can successfully create a schedule but can't create an execution even though both are handled by the same method https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/pkg/scheduler/scheduler.go#L95 ?
- What happens if we delete the schedule records that can't be found in the execution table from the schedule table? (this will probably be an issue)
Do we have to manually create an execution and corresponding task for the above failing schedules? can that solve our problem?
- Also something to think about: majority of our schedules are RETENTION (I dare say more than 80%). Can the fact that we have all of those schedules scheduled to run at the same time cause resource hogging?
Those exist already, but feel free to inject new ones (being careful, make backups and such).
Also something to think about: majority of our schedules are RETENTION (I dare say more than 80%). Can the fact that we have all of those schedules scheduled to run at the same time cause resource hogging?
The RETENTION schedules are created per-project, in tools we have several, in toolsbeta we have only a few. It could be causing resource contention but there's no trace at all that it even tried to run the EXECUTION_SWEEP, while all the RETENTION ones get run OK, and others like GARBAGE_COLLECTION.
When we created the 3.5k projects at the same time you could see the jobs taking some time to run (the pending jobs would pile up), but I have not seen it since we only create them on-demand
No @dcaro, execution records for the failing schedules do not exist. We only have one one schedule with the id of 3218, vendor_id of -1 and vendor_type of EXECUTION_SWEEP on tools harbor db. I don't know the reason for this but harbor according to the code in the schedule.go file only makes use of the value of the schedule id above 3218 as vendor_id and uses a constant value SCHEDULER as vendor_type. Trying to get an execution record just like harbor using vendor_id 3218 and vendor_type SCHEDULER returns zero rows for tools. Doing something similar for toolsbeta returns a single row. The same applies to all the schedules that throw the no execution found for schedule error
No @dcaro, execution records for the failing schedules do not exist. We only have one one schedule with the id of 3218, vendor_id of -1 and vendor_type of EXECUTION_SWEEP on tools harbor db. I don't know the reason for this but harbor according to the code in the schedule.go file only makes use of the value of the schedule id above 3218 as vendor_id and uses a constant value SCHEDULER as vendor_type. Trying to get an execution record just like harbor using vendor_id 3218 and vendor_type SCHEDULER returns zero rows for tools. Doing something similar for toolsbeta returns a single row. The same applies to all the schedules that throw the no execution found for schedule error
Can you elaborate? (add also queries if needed).
As far as I can see in the code, there's no limiting about which schedules it iterates on:
https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/jobservice/sync/schedule.go#L197
So it just goes over all of them, then for each of them it gets all the executions with a matching vendor_id/vendor_type tuple.
Focusing only on the EXECUTION_SWEEP one (the one with id 3218):
2024-08-13T14:31:58Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: EXECUTION_SWEEP:-1
You can find also 6 executions with that tuple:
harbor=> select * from execution where vendor_id=-1 and vendor_type='EXECUTION_SWEEP';
id | vendor_type | vendor_id | status | status_message | trigger | extra_attrs | start_time | end_time | revision | update_time
-----------+-----------------+-----------+---------+----------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+---------------------+----------+---------------------
241475809 | EXECUTION_SWEEP | -1 | Sucesss | | SCHEDULE | {"execution_retain_counts":{"EXECUTION_SWEEP":10,"GARBAGE_COLLECTION":50,"IMAGE_SCAN":1,"P2P_PREHEAT":50,"PURGE_AUDIT_LOG":10,"REPLICATION":50,"RETENTION":50,"SCAN_ALL":1,"SCAN_DATA_EXPORT":50,"SLACK":50,"SYSTEM_ARTIFACT_CLEANUP":50,"WEBHOOK":50}} | 2024-01-28 13:00:38.729526 | 2024-01-29 13:10:38 | 1 | 2024-01-29 13:11:33
242293361 | EXECUTION_SWEEP | -1 | Sucesss | | SCHEDULE | {"execution_retain_counts":{"EXECUTION_SWEEP":10,"GARBAGE_COLLECTION":50,"IMAGE_SCAN":1,"P2P_PREHEAT":50,"PURGE_AUDIT_LOG":10,"REPLICATION":50,"RETENTION":50,"SCAN_ALL":1,"SCAN_DATA_EXPORT":50,"SLACK":50,"SYSTEM_ARTIFACT_CLEANUP":50,"WEBHOOK":50}} | 2024-01-29 10:00:42.098661 | 0001-01-01 00:00:00 | 1 | 2024-01-29 13:13:48
242371241 | EXECUTION_SWEEP | -1 | Sucesss | | SCHEDULE | {"execution_retain_counts":{"EXECUTION_SWEEP":10,"GARBAGE_COLLECTION":50,"IMAGE_SCAN":1,"P2P_PREHEAT":50,"PURGE_AUDIT_LOG":10,"REPLICATION":50,"RETENTION":50,"SCAN_ALL":1,"SCAN_DATA_EXPORT":50,"SLACK":50,"SYSTEM_ARTIFACT_CLEANUP":50,"WEBHOOK":50}} | 2024-01-29 12:00:34.410626 | 0001-01-01 00:00:00 | 1 | 2024-01-29 13:13:48
242332298 | EXECUTION_SWEEP | -1 | Sucesss | | SCHEDULE | {"execution_retain_counts":{"EXECUTION_SWEEP":10,"GARBAGE_COLLECTION":50,"IMAGE_SCAN":1,"P2P_PREHEAT":50,"PURGE_AUDIT_LOG":10,"REPLICATION":50,"RETENTION":50,"SCAN_ALL":1,"SCAN_DATA_EXPORT":50,"SLACK":50,"SYSTEM_ARTIFACT_CLEANUP":50,"WEBHOOK":50}} | 2024-01-29 11:00:40.478287 | 0001-01-01 00:00:00 | 1 | 2024-01-29 13:13:48
242410187 | EXECUTION_SWEEP | -1 | Sucesss | | SCHEDULE | {"execution_retain_counts":{"EXECUTION_SWEEP":10,"GARBAGE_COLLECTION":50,"IMAGE_SCAN":1,"P2P_PREHEAT":50,"PURGE_AUDIT_LOG":10,"REPLICATION":50,"RETENTION":50,"SCAN_ALL":1,"SCAN_DATA_EXPORT":50,"SLACK":50,"SYSTEM_ARTIFACT_CLEANUP":50,"WEBHOOK":50}} | 2024-01-29 13:00:40.303609 | 0001-01-01 00:00:00 | 1 | 2024-01-29 13:13:48
242449127 | EXECUTION_SWEEP | -1 | Sucesss | | SCHEDULE | {"execution_retain_counts":{"EXECUTION_SWEEP":10,"GARBAGE_COLLECTION":50,"IMAGE_SCAN":1,"P2P_PREHEAT":50,"PURGE_AUDIT_LOG":10,"REPLICATION":50,"RETENTION":50,"SCAN_ALL":1,"SCAN_DATA_EXPORT":50,"SLACK":50,"SYSTEM_ARTIFACT_CLEANUP":50,"WEBHOOK":50}} | 2024-01-29 14:00:37.550226 | 0001-01-01 00:00:00 | 1 | 2024-01-29 14:51:38
(6 rows)this is what the harbor code is doing to get executions:
executions, err := w.coreExecutionManager.List(ctx, &q.Query{
keywords: map[string]interface{}{
"vendor_type": scheduler.JobNameScheduler,
"vendor_id": schedule.ID,
},
})As we can see from the above they don't make use of the schedule.vendor_id, and schedule.vendor_type. what they are using is a constant JobNameScheduler the value of which is SCHEDULER (I checked) as vendor_type and schedule.ID as vendor_id.
The queries that prove this can be found in my first comment https://phabricator.wikimedia.org/T370843#10060402
Oh, I see, nice! that looks quite promising :)
The execution info for that vendor type has a policy id, we will have to figure that out too
by looking at https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/jobservice/sync/schedule.go#L249, I suspect that the policy as long as it has a unique id, should be ok (it seems to regenerate it if it's not there)
yes that is true. the redis side will probably autocorrect if it can find the execution. What we don't know yet is why those executions can't be found in the first place. We can manually recreate them and see what happens. For the retention schedules we may need to check if those are still necessary or are relics from the time we were still creating everything at once. Finding out why they are not in the execution table and what happens when they are created seems to be key to our problems
They might have been removed the last time we did a cleanup (or lost when the DB crashed).
+1 to trying to create them manually, for the retention ones, worst case scenario we can remove them and recreate them from the API/UI, but the EXECUTION_SWEEP can't be created from there.
Update:
- I deleted the schedules from the schedule table that there records can't be found in the execution table. I did that after I was certain that system schedules will be recreated if they can't be found (part of this is handled by https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/controller/task/sweep.go#L102).
Doing this fixed the error of no execution found for schedule: EXECUTION_SWEEP:-1
- After doing the above, I started seeing a different set of errors (or maybe they were there before just that we weren't focusing on them before):
root@tools-harbor-1:/home/raymond-ndibe# docker logs harbor-jobservice --since 2024-08-13T22:55:00Z --until 2024-08-13T23:00:10Z | grep -F '[ERROR]' | grep -E --color 'ERROR|$'
2024-08-13T22:55:00Z [ERROR] [/jobservice/runtime/bootstrap.go:168]: trigger hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID d511104b4c19cdf5f2a724d2 not found"}]}
2024-08-13T22:55:00Z [ERROR] [/jobservice/hook/hook_agent.go:155]: Retry: sending hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID d511104b4c19cdf5f2a724d2 not found"}]}
2024-08-13T22:55:02Z [ERROR] [/jobservice/runtime/bootstrap.go:168]: trigger hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID 28c540ee73e94163526eb898 not found"}]}
2024-08-13T22:55:02Z [ERROR] [/jobservice/hook/hook_agent.go:155]: Retry: sending hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID 28c540ee73e94163526eb898 not found"}]}
2024-08-13T22:55:02Z [ERROR] [/jobservice/runtime/bootstrap.go:168]: trigger hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID 28c540ee73e94163526eb898 not found"}]}
2024-08-13T22:55:02Z [ERROR] [/jobservice/hook/hook_agent.go:155]: Retry: sending hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID 28c540ee73e94163526eb898 not found"}]}
2024-08-13T22:55:02Z [ERROR] [/jobservice/runtime/bootstrap.go:168]: trigger hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID 28c540ee73e94163526eb898 not found"}]}
2024-08-13T22:55:02Z [ERROR] [/jobservice/hook/hook_agent.go:155]: Retry: sending hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID 28c540ee73e94163526eb898 not found"}]}
2024-08-13T23:00:08Z [ERROR] [/pkg/task/sweep_job.go:175]: [RETENTION] failed to batch clean candidates, error: failed to delete executions: ERROR: update or delete on table "execution" violates foreign key constraint "task_execution_id_fkey" on table "task" (SQLSTATE 23503)
2024-08-13T23:00:08Z [ERROR] [/pkg/task/sweep_job.go:110]: [RETENTION] failed to run sweep, error: failed to delete executions: ERROR: update or delete on table "execution" violates foreign key constraint "task_execution_id_fkey" on table "task" (SQLSTATE 23503)
2024-08-13T23:00:08Z [ERROR] [/jobservice/runner/redis.go:122]: Job 'EXECUTION_SWEEP:6520ae7a7c0d27a933aadb2b' exit with error: run error: {"errors":[{"code":"UNKNOWN","message":"failed to delete executions: ERROR: update or delete on table \"execution\" violates foreign key constraint \"task_execution_id_fkey\" on table \"task\" (SQLSTATE 23503)"}]}
root@tools-harbor-1:/home/raymond-ndibe#the errors from /jobservice/hook/hook_agent.go:155 and /jobservice/runtime/bootstrap.go:168 looks like they are as a result of some race-condition and after a couple of retries they go away, the more interesting error however is the one from /pkg/task/sweep_job.go (seems to be the same one being reported in /jobservice/runner/redis.go:122).
This error is the reason executions and tasks are not getting deleted and therefore the reason our database keeps blowing up.
the actual cleanup of tasks and executions is handled by https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/pkg/task/sweep_manager.go#L162.
From the above file, the tasks are always deleted before executions to avoid the foreign_key constraint error we see above. But tasks can only be deleted when the status_code is 3. It just turns out that we have some stale tasks in pending and stopped states that are also no longer in redis and these tasks were preventing us from deleting the plenty tasks records in the tasks table, which is in turn preventing us from deleting the execution records in the execution table.
The next thing I did was to discover the culprits with select * from task where status_code!=3; and then delete them using delete from task where status_code!=3;
In the next few minutes we'll know if this worked or not. I still don't know how it all began in the first place and therefore can't confidently say that it won't happen again
Seems to be working rn 💃🕺💃🕺💃🕺💃🕺.
- Execution table went from 50k+ records to 4k+
- Similar thing for tasks
root@tools-harbor-1:/home/raymond-ndibe# docker logs harbor-jobservice --since 2024-08-14T03:00:03Z | grep -F '[INFO] [/pkg/task/sweep_job.go' 2024-08-14T03:00:04Z [INFO] [/pkg/task/sweep_job.go:91]: start to run sweep job 2024-08-14T03:00:06Z [INFO] [/pkg/task/sweep_job.go:150]: [P2P_PREHEAT] start to sweep, retain latest 50 executions 2024-08-14T03:00:06Z [INFO] [/pkg/task/sweep_job.go:160]: [P2P_PREHEAT] listed 0 candidate executions for sweep 2024-08-14T03:00:06Z [INFO] [/pkg/task/sweep_job.go:180]: [P2P_PREHEAT] end to sweep, 0 executions were deleted in total, elapsed time: 2.52058ms 2024-08-14T03:00:06Z [INFO] [/pkg/task/sweep_job.go:150]: [PURGE_AUDIT_LOG] start to sweep, retain latest 10 executions 2024-08-14T03:00:06Z [INFO] [/pkg/task/sweep_job.go:160]: [PURGE_AUDIT_LOG] listed 1 candidate executions for sweep 2024-08-14T03:00:06Z [INFO] [/pkg/task/sweep_job.go:180]: [PURGE_AUDIT_LOG] end to sweep, 1 executions were deleted in total, elapsed time: 18.053031ms 2024-08-14T03:00:06Z [INFO] [/pkg/task/sweep_job.go:150]: [RETENTION] start to sweep, retain latest 50 executions 2024-08-14T03:00:07Z [INFO] [/pkg/task/sweep_job.go:160]: [RETENTION] listed 58109 candidate executions for sweep 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [RETENTION] end to sweep, 58109 executions were deleted in total, elapsed time: 2.768114649s 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [SCAN_ALL] start to sweep, retain latest 1 executions 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [SCAN_ALL] listed 0 candidate executions for sweep 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [SCAN_ALL] end to sweep, 0 executions were deleted in total, elapsed time: 1.258635ms 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [SCAN_DATA_EXPORT] start to sweep, retain latest 50 executions 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [SCAN_DATA_EXPORT] listed 0 candidate executions for sweep 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [SCAN_DATA_EXPORT] end to sweep, 0 executions were deleted in total, elapsed time: 860.929µs 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [SLACK] start to sweep, retain latest 50 executions 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [SLACK] listed 0 candidate executions for sweep 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [SLACK] end to sweep, 0 executions were deleted in total, elapsed time: 925.737µs 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [GARBAGE_COLLECTION] start to sweep, retain latest 50 executions 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [GARBAGE_COLLECTION] listed 1 candidate executions for sweep 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [GARBAGE_COLLECTION] end to sweep, 1 executions were deleted in total, elapsed time: 10.888631ms 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [IMAGE_SCAN] start to sweep, retain latest 1 executions 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [IMAGE_SCAN] listed 0 candidate executions for sweep 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [IMAGE_SCAN] end to sweep, 0 executions were deleted in total, elapsed time: 760.149µs 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [SYSTEM_ARTIFACT_CLEANUP] start to sweep, retain latest 50 executions 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [SYSTEM_ARTIFACT_CLEANUP] listed 0 candidate executions for sweep 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [SYSTEM_ARTIFACT_CLEANUP] end to sweep, 0 executions were deleted in total, elapsed time: 1.58897ms 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [WEBHOOK] start to sweep, retain latest 50 executions 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [WEBHOOK] listed 0 candidate executions for sweep 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [WEBHOOK] end to sweep, 0 executions were deleted in total, elapsed time: 655.146µs 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [EXECUTION_SWEEP] start to sweep, retain latest 10 executions 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [EXECUTION_SWEEP] listed 0 candidate executions for sweep 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [EXECUTION_SWEEP] end to sweep, 0 executions were deleted in total, elapsed time: 2.033957ms 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [REPLICATION] start to sweep, retain latest 50 executions 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [REPLICATION] listed 0 candidate executions for sweep 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [REPLICATION] end to sweep, 0 executions were deleted in total, elapsed time: 720.633µs 2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:115]: end to run sweep job root@tools-harbor-1:/home/raymond-ndibe#
We should monitor this for a few more days/week(s) before we can confidently close the task as resolved.
This can be closed now, the current size is <100M for the biggest table:
harbor=> select schemaname as table_schema, relname as table_name, pg_size_pretty(pg_relation_size(relid)) as data_size from pg_catalog.pg_statio_user_tables order by pg_relation_size(relid) desc; table_schema | table_name | data_size --------------+-----------------------------+------------ public | execution | 89 MB public | task | 81 MB public | project | 832 kB public | artifact_blob | 824 kB public | project_metadata | 656 kB public | project_member | 648 kB public | cve_allowlist | 568 kB public | blob | 440 kB public | artifact | 336 kB public | project_blob | 216 kB ...

