Page MenuHomePhabricator

[harbor] 2024-07-24 Tools harbor db out of space
Closed, ResolvedPublic

Event Timeline

dcaro triaged this task as High priority.
dcaro updated the task description. (Show Details)

The culprit tables using space is the old executions and task:

harbor=> select schemaname as table_schema,
       relname as table_name,
       pg_size_pretty(pg_relation_size(relid)) as data_size
from pg_catalog.pg_statio_user_tables
order by pg_relation_size(relid) desc;
 table_schema |         table_name          | data_size  
--------------+-----------------------------+------------
 public       | execution                   | 1952 MB
 public       | task                        | 612 MB
 public       | schedule                    | 9880 kB
 public       | retention_policy            | 1784 kB
 public       | artifact_blob               | 1416 kB
 public       | project_metadata            | 1064 kB
 public       | project                     | 936 kB
 public       | project_member              | 792 kB
 public       | cve_allowlist               | 728 kB
 public       | blob                        | 584 kB
 public       | quota_usage                 | 424 kB
 public       | quota                       | 344 kB
 public       | artifact                    | 320 kB
 public       | project_blob                | 192 kB
 public       | tag                         | 40 kB
 public       | audit_log                   | 32 kB
 public       | repository                  | 24 kB
 public       | job_queue_status            | 8192 bytes
 public       | schema_migrations           | 8192 bytes
 public       | alembic_version             | 8192 bytes
 public       | role                        | 8192 bytes
 public       | data_migrations             | 8192 bytes
 public       | access                      | 8192 bytes
 public       | role_permission             | 8192 bytes
 public       | permission_policy           | 8192 bytes
 public       | robot                       | 8192 bytes
 public       | immutable_tag_rule          | 8192 bytes
 public       | artifact_trash              | 8192 bytes
 public       | harbor_user                 | 8192 bytes
 public       | properties                  | 8192 bytes
 public       | p2p_preheat_instance        | 0 bytes

The cleanup might not be working correctly

Even though in the UI log rotation is setup and "ok"

image.png (899×3 px, 144 KB)

A manual run succeeded, but cleaned up nothing:

2024-07-24T08:40:04Z [INFO] [/jobservice/job/impl/purge/purge.go:93]: Purge audit job start
2024-07-24T08:40:04Z [INFO] [/jobservice/job/impl/purge/purge.go:94]: job parameters map[audit_retention_hour:1 dry_run:false include_operations:create,delete,pull]
2024-07-24T08:40:04Z [INFO] [/jobservice/job/impl/purge/purge.go:114]: Purge operation parameter, retention_hour=1, include_operations:[create delete pull], dry_run:false
2024-07-24T08:40:04Z [INFO] [/jobservice/job/impl/purge/purge.go:119]: Purged 0 rows of audit logs

There's some options we can tweak from trove directly, but they are not the same I manually changed, I'll open a task to investigate

dcaro renamed this task from [harbor] 2024-07-24 Tools harbor db out of space to [harbor] Investigate how to deactivate wal from trove for postrges databasese.Jul 24 2024, 9:14 AM
dcaro updated the task description. (Show Details)
dcaro renamed this task from [harbor] Investigate how to deactivate wal from trove for postrges databasese to [harbor] 2024-07-24 Tools harbor db out of space .Jul 24 2024, 9:16 AM
dcaro updated the task description. (Show Details)

Restarting the jobservice container and looking at the logs shows some errors on startup, that affect the EXECUTION_SWEEP job (that should be cleaning up that table):

2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: SYSTEM_ARTIFACT_CLEANUP:0
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: EXECUTION_SWEEP:-1
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:2144
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:545
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:2641
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3213
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:1341
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:1448
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3212
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3244
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3175
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3241
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3221
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3229
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3211
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3238
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:564
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3225
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3208
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3210
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3240
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3228
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3232
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:1
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3145
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:219
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3239
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3220
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3237
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3243
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:1736
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3226
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:2848
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3217
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3224
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:947
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3234
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3245
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3219
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3006
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3222
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3227
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3230
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:2625
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3242
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3223
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3235
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3236
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3231
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3246
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3209
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:950
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3218
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3233
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3216
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3164
2024-07-30T08:28:46Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: RETENTION:3215

I did a quick cleanup of the db manually:

harbor=> delete from task where vendor_type in('IMAGE_SCAN', 'RETENTION') AND status not in('Pending', 'Scheduled', 'Running') AND creation_time< NOW() - INTERVAL '1 DAY'; 
...
harbor=> DELETE from execution where execution.id not in (select execution_id from task);                                                                                   
...
harbor=> vacuum full analyze;
...
harbor=> select schemaname as table_schema,
       relname as table_name,
       pg_size_pretty(pg_relation_size(relid)) as data_size
from pg_catalog.pg_statio_user_tables
order by pg_relation_size(relid) desc;
 table_schema |         table_name          | data_size  
--------------+-----------------------------+------------
 public       | task                        | 6336 kB
 public       | execution                   | 5144 kB
 public       | project                     | 800 kB
...

So we passed from ~2G to 6M, next step make the EXECUTION_SWEEP schedule run so we don't have to manually do this.

So there's three related tables in the postrges database, execution, task and schedule, where the vendor_type is EXECUTION_SWEEP.

The code that throws the error we see on startup of harbor-jobs is:
https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/jobservice/sync/schedule.go#L362

It's iterating over the schedules, then getting the first execution that matches the vendor_id/vendor_type tuple, and from that the first task with that execution_id. I can do that manually in the DB without issues:

harbor=> select execution.id from schedule left join execution using(vendor_type,vendor_id) where vendor_type='EXECUTION_SWEEP' limit 1;
    id     
-----------
 241475809
(1 row)

harbor=> select * from task where execution_id=241475809;
   id    | execution_id |          job_id          | status  | status_code | status_revision | status_message | run_count | extra_attrs |       creation_time        |     start_time      |     update_time     |      end_time       |   vendor_type   
---------+--------------+--------------------------+---------+-------------+-----------------+----------------+-----------+-------------+----------------------------+---------------------+---------------------+---------------------+-----------------
 2099342 |    241475809 | 9f22b791b5ccf21163071c35 | Success |           3 |      1706446838 |                |         1 | {}          | 2024-01-28 13:00:38.731168 | 2024-01-28 13:00:38 | 2024-01-29 13:13:47 | 2024-01-29 13:10:38 | EXECUTION_SWEEP
(1 row)

Then the code goes on to load the policies from redis for that, but it fails already when trying to get the execution.

To get to the DB:

root@tools-harbor-1:~# cat /srv/ops/harbor/harbor.yml | grep '^    password:'| cut -d: -f2; psql -h 5ujoynvlt5c.svc.trove.eqiad1.wikimedia.cloud harbor harbor

jobs are stored in redis:

root@toolsbeta-harbor-1:~# docker exec -ti redis redis-cli -n 2
127.0.0.1:6379[2]> scan 0 match *9ab67b659a81a000d4a945e4* count 10000
1) "0"
2) 1) "{harbor_job_service_namespace}:job_stats:9ab67b659a81a000d4a945e4"
127.0.0.1:6379[2]> zrange {harbor_job_service_namespace}:job_stats:9ab67b659a81a000d4a945e4 0 -1
(error) WRONGTYPE Operation against a key holding the wrong kind of value
127.0.0.1:6379[2]> get {harbor_job_service_namespace}:job_stats:9ab67b659a81a000d4a945e4
(error) WRONGTYPE Operation against a key holding the wrong kind of value
127.0.0.1:6379[2]> info {harbor_job_service_namespace}:job_stats:9ab67b659a81a000d4a945e4
127.0.0.1:6379[2]> getrange {harbor_job_service_namespace}:job_stats:9ab67b659a81a000d4a945e4 0 -1
(error) WRONGTYPE Operation against a key holding the wrong kind of value
127.0.0.1:6379[2]> type {harbor_job_service_namespace}:job_stats:9ab67b659a81a000d4a945e4
hash
127.0.0.1:6379[2]> HGETALL {harbor_job_service_namespace}:job_stats:9ab67b659a81a000d4a945e4
 1) "cron_spec"
 2) ""
 3) "unique"
 4) "0"
 5) "enqueue_time"
 6) "1723453204"
 7) "ack"
 8) "{\"status\":\"Success\",\"revision\":1723453204,\"check_in_at\":0}"
 9) "update_time"
10) "1723453204"
11) "web_hook_url"
12) "http://core:8080/service/notifications/tasks/2623104"
13) "numeric_policy_id"
14) "0"
15) "id"
16) "9ab67b659a81a000d4a945e4"
17) "ref_link"
18) "/api/v1/jobs/9ab67b659a81a000d4a945e4"
19) "name"
20) "EXECUTION_SWEEP"
21) "kind"
22) "Generic"
23) "revision"
24) "1723453204"
25) "status"
26) "Success"
27) "parameters"
28) "{\"execution_retain_counts\":{\"EXECUTION_SWEEP\":10,\"GARBAGE_COLLECTION\":50,\"IMAGE_SCAN\":1,\"P2P_PREHEAT\":50,\"PURGE_AUDIT_LOG\":10,\"REPLICATION\":50,\"RETENTION\":50,\"SCAN_ALL\":1,\"SCAN_DATA_EXPORT\":50,\"SLACK\":50,\"SYSTEM_ARTIFACT_CLEANUP\":50,\"WEBHOOK\":50}}"
29) "run_at"
30) "0"

And tools does not have an entry for any of the execution_sweep jobs:

harbor=> select * from task where execution_id in (select execution.id from schedule left join execution using(vendor_type,vendor_id) where vendor_type='EXECUTION_SWEEP' limit 1);
   id    | execution_id |          job_id          | status  | status_code | status_revision | status_message | run_count | extra_attrs |       creation_time        |     start_time      |     update_time     |      end_time       |   vendor_type   
---------+--------------+--------------------------+---------+-------------+-----------------+----------------+-----------+-------------+----------------------------+---------------------+---------------------+---------------------+-----------------
 2099342 |    241475809 | 9f22b791b5ccf21163071c35 | Success |           3 |      1706446838 |                |         1 | {}          | 2024-01-28 13:00:38.731168 | 2024-01-28 13:00:38 | 2024-01-29 13:13:47 | 2024-01-29 13:10:38 | EXECUTION_SWEEP
(1 row)

harbor=> 
\q
root@tools-harbor-1:~# docker exec -ti redis redis-cli -n 2
127.0.0.1:6379[2]> scan 0 match *9f22b791b5ccf21163071c35* count 1000
1) "65312"
2) (empty array)

So there's three related tables in the postrges database, execution, task and schedule, where the vendor_type is EXECUTION_SWEEP.

The code that throws the error we see on startup of harbor-jobs is:
https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/jobservice/sync/schedule.go#L362

It's iterating over the schedules, then getting the first execution that matches the vendor_id/vendor_type tuple, and from that the first task with that execution_id. I can do that manually in the DB without issues:

harbor=> select execution.id from schedule left join execution using(vendor_type,vendor_id) where vendor_type='EXECUTION_SWEEP' limit 1;
    id     
-----------
 241475809
(1 row)

harbor=> select * from task where execution_id=241475809;
   id    | execution_id |          job_id          | status  | status_code | status_revision | status_message | run_count | extra_attrs |       creation_time        |     start_time      |     update_time     |      end_time       |   vendor_type   
---------+--------------+--------------------------+---------+-------------+-----------------+----------------+-----------+-------------+----------------------------+---------------------+---------------------+---------------------+-----------------
 2099342 |    241475809 | 9f22b791b5ccf21163071c35 | Success |           3 |      1706446838 |                |         1 | {}          | 2024-01-28 13:00:38.731168 | 2024-01-28 13:00:38 | 2024-01-29 13:13:47 | 2024-01-29 13:10:38 | EXECUTION_SWEEP
(1 row)

Then the code goes on to load the policies from redis for that, but it fails already when trying to get the execution.

I played with this a bit and it looks like we missed some important things.

  • We mistakenly thought that the error in the logs can't be reproduced manually because we can successfully perform the query harbor code is having issues with manually on our end without any problem, turns out that's not exactly true.

the assumption is that harbor is iterating over schedules and getting vendor_type and vendor_id, then directly using those to query the execution table. What it does instead is:

  • iterate over schedules and for each schedule get the schedule.id.
  • Using this schedule.id as vendor_id and the constant jobNameScheduler (value of which is SCHEDULER) as vendor_type in the execution table, query the execution table for matching record.

this can be seen in the code by looking more closely at the block https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/jobservice/sync/schedule.go#L364

attempt to reproduce the above explanation was also successful.

Toolsbeta

harbor=> select * from schedule where vendor_id=-1 and vendor_type='EXECUTION_SWEEP';
 id |       creation_time        |        update_time         |   vendor_type   | vendor_id |    cron     |    callback_func_name    | callback_func_param | cron_type | extra_attrs |  revision  
----+----------------------------+----------------------------+-----------------+-----------+-------------+--------------------------+---------------------+-----------+-------------+------------
 19 | 2023-10-20 10:23:53.947166 | 2023-10-20 10:23:53.947166 | EXECUTION_SWEEP |        -1 | 0 0 * * * * | EXECUTION_SWEEP_CALLBACK | null                | Custom    | null        | 1723500003
(1 row)

harbor=> select * from execution where vendor_id=19 and vendor_type='SCHEDULER';
   id   | vendor_type | vendor_id | status  | status_message | trigger | extra_attrs |         start_time         |      end_time       | revision |     update_time     
--------+-------------+-----------+---------+----------------+---------+-------------+----------------------------+---------------------+----------+---------------------
 562701 | SCHEDULER   |        19 | Success |                | MANUAL  | {}          | 2023-10-20 10:23:53.948532 | 2024-08-12 22:00:03 |     7147 | 2024-08-12 22:00:10
(1 row)

Tools

harbor=> select * from schedule where vendor_id=-1 and vendor_type='EXECUTION_SWEEP';
  id  |       creation_time        |        update_time         |   vendor_type   | vendor_id |    cron     |    callback_func_name    | callback_func_param | cron_type | extra_attrs |  revision  
------+----------------------------+----------------------------+-----------------+-----------+-------------+--------------------------+---------------------+-----------+-------------+------------
 3218 | 2023-10-25 08:36:26.460075 | 2023-10-25 08:36:26.460075 | EXECUTION_SWEEP |        -1 | 0 0 * * * * | EXECUTION_SWEEP_CALLBACK | null                | Custom    | null        | 1706536837
(1 row)


harbor=> select * from execution where vendor_id=3218 and vendor_type='SCHEDULER';
 id | vendor_type | vendor_id | status | status_message | trigger | extra_attrs | start_time | end_time | revision | update_time 
----+-------------+-----------+--------+----------------+---------+-------------+------------+----------+----------+-------------
(0 rows)

The above discovery is significant because worker.getTask is invoked by worker.Run. worker.Run itself is invoked in an infinite for loop by worker.Start and only exists out of the for loop when worker.Run returns no error. When no execution is found by worker.getTask, it returns no execution found * error to worker.Run, which in turn returns the same error to worker.Start, as a result we will never get out of the worker.Start infinite for-loop (I am not sure though that this infinite for-loop error is problem because worker.Start invokes worker.Run in a non-blocking way)

You can see this loop behavior with the query docker logs harbor-jobservice | grep -F '[ERROR] [/jobservice/sync/schedule.go' | head -n 10000. Observe how it keeps repeating the same things over and over.

Possible next steps

  • Figure out why their is no record of the schedule in the execution table. Did we remove the records manually? Is it that for some reason harbor can successfully create a schedule but can't create an execution even though both are handled by the same method https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/pkg/scheduler/scheduler.go#L95 ?
  • What happens if we delete the schedule records that can't be found in the execution table from the schedule table? (this will probably be an issue)

Do we have to manually create an execution and corresponding task for the above failing schedules? can that solve our problem?

  • Also something to think about: majority of our schedules are RETENTION (I dare say more than 80%). Can the fact that we have all of those schedules scheduled to run at the same time cause resource hogging?

Do we have to manually create an execution and corresponding task for the above failing schedules? can that solve our problem?

Those exist already, but feel free to inject new ones (being careful, make backups and such).

Also something to think about: majority of our schedules are RETENTION (I dare say more than 80%). Can the fact that we have all of those schedules scheduled to run at the same time cause resource hogging?

The RETENTION schedules are created per-project, in tools we have several, in toolsbeta we have only a few. It could be causing resource contention but there's no trace at all that it even tried to run the EXECUTION_SWEEP, while all the RETENTION ones get run OK, and others like GARBAGE_COLLECTION.

When we created the 3.5k projects at the same time you could see the jobs taking some time to run (the pending jobs would pile up), but I have not seen it since we only create them on-demand

image.png (509×1 px, 59 KB)

Do we have to manually create an execution and corresponding task for the above failing schedules? can that solve our problem?

Those exist already, but feel free to inject new ones (being careful, make backups and such).

Also something to think about: majority of our schedules are RETENTION (I dare say more than 80%). Can the fact that we have all of those schedules scheduled to run at the same time cause resource hogging?

The RETENTION schedules are created per-project, in tools we have several, in toolsbeta we have only a few. It could be causing resource contention but there's no trace at all that it even tried to run the EXECUTION_SWEEP, while all the RETENTION ones get run OK, and others like GARBAGE_COLLECTION.

When we created the 3.5k projects at the same time you could see the jobs taking some time to run (the pending jobs would pile up), but I have not seen it since we only create them on-demand

image.png (509×1 px, 59 KB)

No @dcaro, execution records for the failing schedules do not exist. We only have one one schedule with the id of 3218, vendor_id of -1 and vendor_type of EXECUTION_SWEEP on tools harbor db. I don't know the reason for this but harbor according to the code in the schedule.go file only makes use of the value of the schedule id above 3218 as vendor_id and uses a constant value SCHEDULER as vendor_type. Trying to get an execution record just like harbor using vendor_id 3218 and vendor_type SCHEDULER returns zero rows for tools. Doing something similar for toolsbeta returns a single row. The same applies to all the schedules that throw the no execution found for schedule error

No @dcaro, execution records for the failing schedules do not exist. We only have one one schedule with the id of 3218, vendor_id of -1 and vendor_type of EXECUTION_SWEEP on tools harbor db. I don't know the reason for this but harbor according to the code in the schedule.go file only makes use of the value of the schedule id above 3218 as vendor_id and uses a constant value SCHEDULER as vendor_type. Trying to get an execution record just like harbor using vendor_id 3218 and vendor_type SCHEDULER returns zero rows for tools. Doing something similar for toolsbeta returns a single row. The same applies to all the schedules that throw the no execution found for schedule error

Can you elaborate? (add also queries if needed).

As far as I can see in the code, there's no limiting about which schedules it iterates on:
https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/jobservice/sync/schedule.go#L197

So it just goes over all of them, then for each of them it gets all the executions with a matching vendor_id/vendor_type tuple.
Focusing only on the EXECUTION_SWEEP one (the one with id 3218):

2024-08-13T14:31:58Z [ERROR] [/jobservice/sync/schedule.go:238]: no execution found for schedule: EXECUTION_SWEEP:-1

You can find also 6 executions with that tuple:

harbor=> select * from execution where vendor_id=-1 and vendor_type='EXECUTION_SWEEP';
    id     |   vendor_type   | vendor_id | status  | status_message | trigger  |                                                                                                                       extra_attrs                                                                                                                       |         start_time         |      end_time       | revision |     update_time     
-----------+-----------------+-----------+---------+----------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+---------------------+----------+---------------------
 241475809 | EXECUTION_SWEEP |        -1 | Sucesss |                | SCHEDULE | {"execution_retain_counts":{"EXECUTION_SWEEP":10,"GARBAGE_COLLECTION":50,"IMAGE_SCAN":1,"P2P_PREHEAT":50,"PURGE_AUDIT_LOG":10,"REPLICATION":50,"RETENTION":50,"SCAN_ALL":1,"SCAN_DATA_EXPORT":50,"SLACK":50,"SYSTEM_ARTIFACT_CLEANUP":50,"WEBHOOK":50}} | 2024-01-28 13:00:38.729526 | 2024-01-29 13:10:38 |        1 | 2024-01-29 13:11:33
 242293361 | EXECUTION_SWEEP |        -1 | Sucesss |                | SCHEDULE | {"execution_retain_counts":{"EXECUTION_SWEEP":10,"GARBAGE_COLLECTION":50,"IMAGE_SCAN":1,"P2P_PREHEAT":50,"PURGE_AUDIT_LOG":10,"REPLICATION":50,"RETENTION":50,"SCAN_ALL":1,"SCAN_DATA_EXPORT":50,"SLACK":50,"SYSTEM_ARTIFACT_CLEANUP":50,"WEBHOOK":50}} | 2024-01-29 10:00:42.098661 | 0001-01-01 00:00:00 |        1 | 2024-01-29 13:13:48
 242371241 | EXECUTION_SWEEP |        -1 | Sucesss |                | SCHEDULE | {"execution_retain_counts":{"EXECUTION_SWEEP":10,"GARBAGE_COLLECTION":50,"IMAGE_SCAN":1,"P2P_PREHEAT":50,"PURGE_AUDIT_LOG":10,"REPLICATION":50,"RETENTION":50,"SCAN_ALL":1,"SCAN_DATA_EXPORT":50,"SLACK":50,"SYSTEM_ARTIFACT_CLEANUP":50,"WEBHOOK":50}} | 2024-01-29 12:00:34.410626 | 0001-01-01 00:00:00 |        1 | 2024-01-29 13:13:48
 242332298 | EXECUTION_SWEEP |        -1 | Sucesss |                | SCHEDULE | {"execution_retain_counts":{"EXECUTION_SWEEP":10,"GARBAGE_COLLECTION":50,"IMAGE_SCAN":1,"P2P_PREHEAT":50,"PURGE_AUDIT_LOG":10,"REPLICATION":50,"RETENTION":50,"SCAN_ALL":1,"SCAN_DATA_EXPORT":50,"SLACK":50,"SYSTEM_ARTIFACT_CLEANUP":50,"WEBHOOK":50}} | 2024-01-29 11:00:40.478287 | 0001-01-01 00:00:00 |        1 | 2024-01-29 13:13:48
 242410187 | EXECUTION_SWEEP |        -1 | Sucesss |                | SCHEDULE | {"execution_retain_counts":{"EXECUTION_SWEEP":10,"GARBAGE_COLLECTION":50,"IMAGE_SCAN":1,"P2P_PREHEAT":50,"PURGE_AUDIT_LOG":10,"REPLICATION":50,"RETENTION":50,"SCAN_ALL":1,"SCAN_DATA_EXPORT":50,"SLACK":50,"SYSTEM_ARTIFACT_CLEANUP":50,"WEBHOOK":50}} | 2024-01-29 13:00:40.303609 | 0001-01-01 00:00:00 |        1 | 2024-01-29 13:13:48
 242449127 | EXECUTION_SWEEP |        -1 | Sucesss |                | SCHEDULE | {"execution_retain_counts":{"EXECUTION_SWEEP":10,"GARBAGE_COLLECTION":50,"IMAGE_SCAN":1,"P2P_PREHEAT":50,"PURGE_AUDIT_LOG":10,"REPLICATION":50,"RETENTION":50,"SCAN_ALL":1,"SCAN_DATA_EXPORT":50,"SLACK":50,"SYSTEM_ARTIFACT_CLEANUP":50,"WEBHOOK":50}} | 2024-01-29 14:00:37.550226 | 0001-01-01 00:00:00 |        1 | 2024-01-29 14:51:38
(6 rows)

this is what the harbor code is doing to get executions:

executions, err := w.coreExecutionManager.List(ctx, &q.Query{
    keywords: map[string]interface{}{
                      "vendor_type": scheduler.JobNameScheduler,
                       "vendor_id": schedule.ID,
    },
})

As we can see from the above they don't make use of the schedule.vendor_id, and schedule.vendor_type. what they are using is a constant JobNameScheduler the value of which is SCHEDULER (I checked) as vendor_type and schedule.ID as vendor_id.

The queries that prove this can be found in my first comment https://phabricator.wikimedia.org/T370843#10060402

this is what the harbor code is doing to get executions:

executions, err := w.coreExecutionManager.List(ctx, &q.Query{
    keywords: map[string]interface{}{
                      "vendor_type": scheduler.JobNameScheduler,
                       "vendor_id": schedule.ID,
    },
})

As we can see from the above they don't make use of the schedule.vendor_id, and schedule.vendor_type. what they are using is a constant JobNameScheduler the value of which is SCHEDULER (I checked) as vendor_type and schedule.ID as vendor_id.

Oh, I see, nice! that looks quite promising :)

The execution info for that vendor type has a policy id, we will have to figure that out too

by looking at https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/jobservice/sync/schedule.go#L249, I suspect that the policy as long as it has a unique id, should be ok (it seems to regenerate it if it's not there)

by looking at https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/jobservice/sync/schedule.go#L249, I suspect that the policy as long as it has a unique id, should be ok (it seems to regenerate it if it's not there)

yes that is true. the redis side will probably autocorrect if it can find the execution. What we don't know yet is why those executions can't be found in the first place. We can manually recreate them and see what happens. For the retention schedules we may need to check if those are still necessary or are relics from the time we were still creating everything at once. Finding out why they are not in the execution table and what happens when they are created seems to be key to our problems

by looking at https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/jobservice/sync/schedule.go#L249, I suspect that the policy as long as it has a unique id, should be ok (it seems to regenerate it if it's not there)

yes that is true. the redis side will probably autocorrect if it can find the execution. What we don't know yet is why those executions can't be found in the first place. We can manually recreate them and see what happens. For the retention schedules we may need to check if those are still necessary or are relics from the time we were still creating everything at once. Finding out why they are not in the execution table and what happens when they are created seems to be key to our problems

They might have been removed the last time we did a cleanup (or lost when the DB crashed).

+1 to trying to create them manually, for the retention ones, worst case scenario we can remove them and recreate them from the API/UI, but the EXECUTION_SWEEP can't be created from there.

by looking at https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/jobservice/sync/schedule.go#L249, I suspect that the policy as long as it has a unique id, should be ok (it seems to regenerate it if it's not there)

yes that is true. the redis side will probably autocorrect if it can find the execution. What we don't know yet is why those executions can't be found in the first place. We can manually recreate them and see what happens. For the retention schedules we may need to check if those are still necessary or are relics from the time we were still creating everything at once. Finding out why they are not in the execution table and what happens when they are created seems to be key to our problems

They might have been removed the last time we did a cleanup (or lost when the DB crashed).

+1 to trying to create them manually, for the retention ones, worst case scenario we can remove them and recreate them from the API/UI, but the EXECUTION_SWEEP can't be created from there.

Update:

  • I deleted the schedules from the schedule table that there records can't be found in the execution table. I did that after I was certain that system schedules will be recreated if they can't be found (part of this is handled by https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/controller/task/sweep.go#L102).

Doing this fixed the error of no execution found for schedule: EXECUTION_SWEEP:-1

  • After doing the above, I started seeing a different set of errors (or maybe they were there before just that we weren't focusing on them before):
root@tools-harbor-1:/home/raymond-ndibe# docker logs harbor-jobservice --since 2024-08-13T22:55:00Z --until 2024-08-13T23:00:10Z | grep -F '[ERROR]' | grep -E --color 'ERROR|$'
2024-08-13T22:55:00Z [ERROR] [/jobservice/runtime/bootstrap.go:168]: trigger hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID d511104b4c19cdf5f2a724d2 not found"}]}
2024-08-13T22:55:00Z [ERROR] [/jobservice/hook/hook_agent.go:155]: Retry: sending hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID d511104b4c19cdf5f2a724d2 not found"}]}
2024-08-13T22:55:02Z [ERROR] [/jobservice/runtime/bootstrap.go:168]: trigger hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID 28c540ee73e94163526eb898 not found"}]}
2024-08-13T22:55:02Z [ERROR] [/jobservice/hook/hook_agent.go:155]: Retry: sending hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID 28c540ee73e94163526eb898 not found"}]}
2024-08-13T22:55:02Z [ERROR] [/jobservice/runtime/bootstrap.go:168]: trigger hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID 28c540ee73e94163526eb898 not found"}]}
2024-08-13T22:55:02Z [ERROR] [/jobservice/hook/hook_agent.go:155]: Retry: sending hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID 28c540ee73e94163526eb898 not found"}]}
2024-08-13T22:55:02Z [ERROR] [/jobservice/runtime/bootstrap.go:168]: trigger hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID 28c540ee73e94163526eb898 not found"}]}
2024-08-13T22:55:02Z [ERROR] [/jobservice/hook/hook_agent.go:155]: Retry: sending hook event error: {"errors":[{"code":"NOT_FOUND","message":"task with job ID 28c540ee73e94163526eb898 not found"}]}
2024-08-13T23:00:08Z [ERROR] [/pkg/task/sweep_job.go:175]: [RETENTION] failed to batch clean candidates, error: failed to delete executions: ERROR: update or delete on table "execution" violates foreign key constraint "task_execution_id_fkey" on table "task" (SQLSTATE 23503)
2024-08-13T23:00:08Z [ERROR] [/pkg/task/sweep_job.go:110]: [RETENTION] failed to run sweep, error: failed to delete executions: ERROR: update or delete on table "execution" violates foreign key constraint "task_execution_id_fkey" on table "task" (SQLSTATE 23503)
2024-08-13T23:00:08Z [ERROR] [/jobservice/runner/redis.go:122]: Job 'EXECUTION_SWEEP:6520ae7a7c0d27a933aadb2b' exit with error: run error: {"errors":[{"code":"UNKNOWN","message":"failed to delete executions: ERROR: update or delete on table \"execution\" violates foreign key constraint \"task_execution_id_fkey\" on table \"task\" (SQLSTATE 23503)"}]}
root@tools-harbor-1:/home/raymond-ndibe#

the errors from /jobservice/hook/hook_agent.go:155 and /jobservice/runtime/bootstrap.go:168 looks like they are as a result of some race-condition and after a couple of retries they go away, the more interesting error however is the one from /pkg/task/sweep_job.go (seems to be the same one being reported in /jobservice/runner/redis.go:122).
This error is the reason executions and tasks are not getting deleted and therefore the reason our database keeps blowing up.

the actual cleanup of tasks and executions is handled by https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/pkg/task/sweep_manager.go#L162.

From the above file, the tasks are always deleted before executions to avoid the foreign_key constraint error we see above. But tasks can only be deleted when the status_code is 3. It just turns out that we have some stale tasks in pending and stopped states that are also no longer in redis and these tasks were preventing us from deleting the plenty tasks records in the tasks table, which is in turn preventing us from deleting the execution records in the execution table.

The next thing I did was to discover the culprits with select * from task where status_code!=3; and then delete them using delete from task where status_code!=3;

In the next few minutes we'll know if this worked or not. I still don't know how it all began in the first place and therefore can't confidently say that it won't happen again

In the next few minutes we'll know if this worked or not. I still don't know how it all began in the first place and therefore can't confidently say that it won't happen again

Seems to be working rn 💃🕺💃🕺💃🕺💃🕺.

  • Execution table went from 50k+ records to 4k+
  • Similar thing for tasks
root@tools-harbor-1:/home/raymond-ndibe# docker logs harbor-jobservice --since 2024-08-14T03:00:03Z | grep -F '[INFO] [/pkg/task/sweep_job.go'
2024-08-14T03:00:04Z [INFO] [/pkg/task/sweep_job.go:91]: start to run sweep job
2024-08-14T03:00:06Z [INFO] [/pkg/task/sweep_job.go:150]: [P2P_PREHEAT] start to sweep, retain latest 50 executions
2024-08-14T03:00:06Z [INFO] [/pkg/task/sweep_job.go:160]: [P2P_PREHEAT] listed 0 candidate executions for sweep
2024-08-14T03:00:06Z [INFO] [/pkg/task/sweep_job.go:180]: [P2P_PREHEAT] end to sweep, 0 executions were deleted in total, elapsed time: 2.52058ms
2024-08-14T03:00:06Z [INFO] [/pkg/task/sweep_job.go:150]: [PURGE_AUDIT_LOG] start to sweep, retain latest 10 executions
2024-08-14T03:00:06Z [INFO] [/pkg/task/sweep_job.go:160]: [PURGE_AUDIT_LOG] listed 1 candidate executions for sweep
2024-08-14T03:00:06Z [INFO] [/pkg/task/sweep_job.go:180]: [PURGE_AUDIT_LOG] end to sweep, 1 executions were deleted in total, elapsed time: 18.053031ms
2024-08-14T03:00:06Z [INFO] [/pkg/task/sweep_job.go:150]: [RETENTION] start to sweep, retain latest 50 executions
2024-08-14T03:00:07Z [INFO] [/pkg/task/sweep_job.go:160]: [RETENTION] listed 58109 candidate executions for sweep
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [RETENTION] end to sweep, 58109 executions were deleted in total, elapsed time: 2.768114649s
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [SCAN_ALL] start to sweep, retain latest 1 executions
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [SCAN_ALL] listed 0 candidate executions for sweep
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [SCAN_ALL] end to sweep, 0 executions were deleted in total, elapsed time: 1.258635ms
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [SCAN_DATA_EXPORT] start to sweep, retain latest 50 executions
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [SCAN_DATA_EXPORT] listed 0 candidate executions for sweep
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [SCAN_DATA_EXPORT] end to sweep, 0 executions were deleted in total, elapsed time: 860.929µs
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [SLACK] start to sweep, retain latest 50 executions
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [SLACK] listed 0 candidate executions for sweep
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [SLACK] end to sweep, 0 executions were deleted in total, elapsed time: 925.737µs
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [GARBAGE_COLLECTION] start to sweep, retain latest 50 executions
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [GARBAGE_COLLECTION] listed 1 candidate executions for sweep
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [GARBAGE_COLLECTION] end to sweep, 1 executions were deleted in total, elapsed time: 10.888631ms
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [IMAGE_SCAN] start to sweep, retain latest 1 executions
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [IMAGE_SCAN] listed 0 candidate executions for sweep
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [IMAGE_SCAN] end to sweep, 0 executions were deleted in total, elapsed time: 760.149µs
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [SYSTEM_ARTIFACT_CLEANUP] start to sweep, retain latest 50 executions
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [SYSTEM_ARTIFACT_CLEANUP] listed 0 candidate executions for sweep
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [SYSTEM_ARTIFACT_CLEANUP] end to sweep, 0 executions were deleted in total, elapsed time: 1.58897ms
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [WEBHOOK] start to sweep, retain latest 50 executions
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [WEBHOOK] listed 0 candidate executions for sweep
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [WEBHOOK] end to sweep, 0 executions were deleted in total, elapsed time: 655.146µs
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [EXECUTION_SWEEP] start to sweep, retain latest 10 executions
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [EXECUTION_SWEEP] listed 0 candidate executions for sweep
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [EXECUTION_SWEEP] end to sweep, 0 executions were deleted in total, elapsed time: 2.033957ms
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:150]: [REPLICATION] start to sweep, retain latest 50 executions
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:160]: [REPLICATION] listed 0 candidate executions for sweep
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:180]: [REPLICATION] end to sweep, 0 executions were deleted in total, elapsed time: 720.633µs
2024-08-14T03:00:08Z [INFO] [/pkg/task/sweep_job.go:115]: end to run sweep job
root@tools-harbor-1:/home/raymond-ndibe#

We should monitor this for a few more days/week(s) before we can confidently close the task as resolved.

dcaro changed the task status from Open to In Progress.Aug 14 2024, 9:39 AM
dcaro reassigned this task from dcaro to Raymond_Ndibe.
dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 14) board.

This can be closed now, the current size is <100M for the biggest table:

harbor=> select schemaname as table_schema,
relname as table_name,
pg_size_pretty(pg_relation_size(relid)) as data_size
from pg_catalog.pg_statio_user_tables
order by pg_relation_size(relid) desc;
 table_schema |         table_name          | data_size  
--------------+-----------------------------+------------
 public       | execution                   | 89 MB
 public       | task                        | 81 MB
 public       | project                     | 832 kB
 public       | artifact_blob               | 824 kB
 public       | project_metadata            | 656 kB
 public       | project_member              | 648 kB
 public       | cve_allowlist               | 568 kB
 public       | blob                        | 440 kB
 public       | artifact                    | 336 kB
 public       | project_blob                | 216 kB
...