Page MenuHomePhabricator

Wikifunctions CI not cleaning up envs correctly
Closed, ResolvedPublic

Description

Today I noticed a number of CI envs that hadn't been successfully removed:

$ kubectl -n cat-env get pod | grep aw-wf
aw-wf-func-orch-ci-769349-4942-py-evaluator-6789649cdc-6k7rv   1/1     Running   0             10h
aw-wf-func-orch-ci-769349-4942-js-evaluator-76495bc74d-94txs   1/1     Running   0             10h
aw-wf-func-orch-ci-769349-4942-mariadb-85d9b7f75c-xf47p        1/1     Running   0             10h
aw-wf-func-orch-ci-769349-4942-artifact-warehouse              1/1     Running   0             10h
aw-wf-func-orch-ci-769374-4944-py-evaluator-75887c5545-zxrfq   1/1     Running   0             10h
aw-wf-func-orch-ci-769374-4944-js-evaluator-549cc4d68b-6bbmp   1/1     Running   0             10h
aw-wf-func-orch-ci-769374-4944-mariadb-6f5d88b6d9-k6z2h        1/1     Running   0             10h
aw-wf-func-orch-ci-769374-4944-artifact-warehouse              1/1     Running   0             10h
aw-wf-func-orch-ci-769349-4942-mediawiki-77b9f66d7d-xnp2t      4/4     Running   0             10h
aw-wf-func-orch-ci-769374-4944-mediawiki-666f6d9c7-dl4sw       4/4     Running   0             10h
aw-wf-func-orch-ci-769392-4945-js-evaluator-7c6fb8b76c-zxx68   1/1     Running   0             9h
aw-wf-func-orch-ci-769392-4945-mariadb-76c4ff4c87-4r9pt        1/1     Running   0             9h
aw-wf-func-orch-ci-769392-4945-artifact-warehouse              1/1     Running   0             9h
aw-wf-func-orch-ci-769392-4945-py-evaluator-cfb7c9568-nlfzn    1/1     Running   0             9h
aw-wf-func-orch-ci-769392-4945-mediawiki-586545d9db-27pzj      4/4     Running   0             9h

Looking at e.g. the pipeline for env 4945 shows something weird: https://gitlab.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator/-/pipelines/172692

The pipeline created env 4945: https://gitlab.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator/-/jobs/769392
But then it cleaned up env 4943: https://gitlab.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator/-/jobs/769370
Those two jobs belong to pipeline 172692

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Set deleteExpiredWikis schedulerepos/test-platform/catalyst/catalyst-api!177jhuneidijhuneidi-main-patch-47419main
Customize query in GitLab

Event Timeline

The environment which got cleaned up, 4943, was from the previous failed job in that pipeline. 4945 was created when the job was retried. After the first job failure, catalyst-cleanup ran and deleted 4943. When the job was retried, 4945 was created, but since catalyst-cleaup already passed once, it didn't run again. I looked into possibly getting it to run again after retry of the deploy job, but I haven't found any gitlab-ci configuration to do this. There might be a hack around this if we really want the subsequent jobs to run again. In any case, I did a test and a manual rerun of catalyst-cleanup would have deleted 4945.

The cleanup job should be running hourly, though, so those environments should have been cleaned up. We should look into that.

I found the helm values for catalyst-api still had the old default and were overwriting the changes I made to the schedule in catalyst-api

thcipriani assigned this task to jeena.
thcipriani edited projects, added Catalyst (Luka Ijo Pimeja Jan); removed Catalyst.
thcipriani subscribed.

Summary:

  • When jobs in the pipeline pass things get cleaned up at the end of a run
  • If folks retry the deploy and don't manually retry the cleanup after (which...will probably happen), then environments will clean up in about an hour