Page MenuHomePhabricator

Clean up Wikilambda Catalyst environments regardless of test failures
Open, Needs TriagePublic3 Estimated Story Points

Description

We noticed a lot of environments running for Wikilambda while no tests were running. Our suspicion is that these are lingering environments that are not cleaned up when tests fail.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Check hourly for expired wikisrepos/test-platform/catalyst/patchdemo!272jhuneidiT416391main
Delete expired wikis hourlyrepos/test-platform/catalyst/catalyst-api!162jhuneidiT416391main
Write to artifact file directly after env creationrepos/test-platform/catalyst/catalyst-ci-client!2jhuneidijhuneidi-main-patch-cf68main
Customize query in GitLab

Event Timeline

It looks like the environments for wikilambda ci that aren't getting cleaned up are those that did not get a response from catalyst in the appropriate amount of time. When the request to catalyst create times out, then we don't get an environment ID back from catalyst, so even though the post build script has been running, it can't delete the environment.

+ exec docker run --entrypoint=/deploy_env.py -e WIKILAMBDA_REF=16/1229116/19 -e ZUUL_CHANGE=1229116 -e ENV_API_PATH=https://api.catalyst.wmcloud.org/api/environments -e NPM_ARGS=selenium-test -e MEDIAWIKI_USER=Admin -e MEDIAWIKI_PASSWORD=dockerpass -e MW_SCRIPT_PATH=/w --volume /srv/jenkins/workspace/wikilambda-catalyst-end-to-end/src:/src --volume /srv/jenkins/workspace/wikilambda-catalyst-end-to-end/cache:/cache --volume /srv/jenkins/workspace/wikilambda-catalyst-end-to-end/log:/log --security-opt seccomp=unconfined --init --rm --label jenkins.job=wikilambda-catalyst-end-to-end --label jenkins.build=1920 --env-file /dev/fd/63 docker-registry.wikimedia.org/releng/catalyst:1.3.0-s1
++ set +x
ERROR: Failed to create Wikifunctions environment: 504 Server Error: Gateway Time-out for url: https://api.catalyst.wmcloud.org/api/environments
Build step 'Execute shell' marked build as failure
PostBuildScript
Archiving artifacts
[PostBuildScript] - [INFO] Executing post build scripts.
[wikilambda-catalyst-end-to-end] $ /bin/bash -xe /tmp/jenkins4184931845332480205.sh
++ cat log/envid
cat: log/envid: No such file or directory
+ eval ''
+ curl -X DELETE https://api.catalyst.wmcloud.org/api/environments/ -H 'Authorization: ApiToken ****'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
...

See https://integration.wikimedia.org/ci/view/All%20jobs/job/wikilambda-catalyst-end-to-end/1920/console

The timeout period is 2 minutes, which seems generous to me.

It looks like the environments for wikilambda ci that aren't getting cleaned up are those that did not get a response from catalyst in the appropriate amount of time. When the request to catalyst create times out, then we don't get an environment ID back from catalyst, so even though the post build script has been running, it can't delete the environment.
[...]
See https://integration.wikimedia.org/ci/view/All%20jobs/job/wikilambda-catalyst-end-to-end/1920/console

That failed run and other similar ones line up with the times for yesterday's incident: https://wikitech.wikimedia.org/wiki/Catalyst/Incidents/2026-02-10

The cluster was already flailing when those pipelines ran and that's probably why they were seeing time-outs (and other errors: https://integration.wikimedia.org/ci/view/All%20jobs/job/wikilambda-catalyst-end-to-end/1927/console)

I caught one in the wild, this time from gitlab: https://gitlab.wikimedia.org/repos/test-platform/catalyst/catalyst-ci-client/-/pipelines/164402

The pipeline left behind a (now deleted) env. The creation job succeeded in creating the env and got an id back (3975) but then timed out waiting for the logs:https://gitlab.wikimedia.org/repos/test-platform/catalyst/catalyst-ci-client/-/jobs/739151

The pipeline didn't run the deletion job and the env was left behind. This is already a fix we can add -> always run the deletion job unconditionally

Yes, that's correct. I remember having a conversation about it so when I saw this task I thought there must be another task for it, but I don't see one.

Another data point. Today the Jenkins Wikilambda CI left a bunch of envs behind:

mw-ext-wl-ci-1238784-24724-4109-py-evaluator-547786dd5c-snpqv   1/1     Running   0               85m
mw-ext-wl-ci-1238784-24724-4109-js-evaluator-97fff64f4-trkbd    1/1     Running   0               85m
mw-ext-wl-ci-1238784-24724-4109-artifact-warehouse              1/1     Running   0               85m
mw-ext-wl-ci-1238784-24724-4109-mariadb-5b4685c7b9-t6qgz        1/1     Running   0               85m
mw-ext-wl-ci-1238784-24724-4109-mediawiki-88494f449-jszph       4/4     Running   0               85m
mw-ext-wl-ci-1239152-82482-4110-js-evaluator-7dd6465b6f-bshbn   1/1     Running   0               79m
mw-ext-wl-ci-1239152-82482-4110-py-evaluator-c756459cd-5njgb    1/1     Running   0               79m
mw-ext-wl-ci-1239344-35683-4111-js-evaluator-7cc5cc6dc4-qfsfd   1/1     Running   0               79m
mw-ext-wl-ci-1239344-35683-4111-py-evaluator-54797b96f7-q75zx   1/1     Running   0               79m
mw-ext-wl-ci-1239152-82482-4110-artifact-warehouse              1/1     Running   0               79m
mw-ext-wl-ci-1239344-35683-4111-artifact-warehouse              1/1     Running   0               79m
mw-ext-wl-ci-1239152-82482-4110-mariadb-689dbdc869-t6dm7        1/1     Running   0               79m
mw-ext-wl-ci-1239344-35683-4111-mariadb-f95cb96cc-tp55n         1/1     Running   0               79m
mw-ext-wl-ci-1239345-10059-4112-py-evaluator-65b68bb7c5-hw8xd   1/1     Running   0               78m
mw-ext-wl-ci-1239345-10059-4112-js-evaluator-69c4755b7b-lqnnw   1/1     Running   0               78m
mw-ext-wl-ci-1239345-10059-4112-artifact-warehouse              1/1     Running   0               78m
mw-ext-wl-ci-1239345-10059-4112-mariadb-68bd8bc87c-6bjzp        1/1     Running   0               78m
mw-ext-wl-ci-1239152-59560-4113-py-evaluator-587bc78bc8-gksvs   1/1     Running   0               78m
mw-ext-wl-ci-1239152-59560-4113-artifact-warehouse              1/1     Running   0               78m
mw-ext-wl-ci-1239152-59560-4113-mariadb-68f895bf85-469ls        1/1     Running   0               78m
mw-ext-wl-ci-1239152-59560-4113-js-evaluator-67689db7c7-wpxgn   1/1     Running   0               78m
mw-ext-wl-ci-1239344-62262-4114-js-evaluator-55cc55dff8-kmd6x   1/1     Running   0               77m
mw-ext-wl-ci-1239344-62262-4114-artifact-warehouse              1/1     Running   0               77m
mw-ext-wl-ci-1239344-62262-4114-mariadb-65c79bbb57-k24gs        1/1     Running   0               77m
mw-ext-wl-ci-1239344-62262-4114-py-evaluator-75c4f5cff7-b8778   1/1     Running   0               77m
mw-ext-wl-ci-1239344-35683-4111-mediawiki-6cf876bb58-4vmtz      4/4     Running   0               79m
mw-ext-wl-ci-1239344-62262-4114-mediawiki-5d54bcdcc9-gqmpt      4/4     Running   0               77m
mw-ext-wl-ci-1239152-82482-4110-mediawiki-5fb7ddb8cb-b9jhs      4/4     Running   0               79m
mw-ext-wl-ci-1239345-10059-4112-mediawiki-5c64d87fb7-9x9j6      4/4     Running   0               78m
mw-ext-wl-ci-1239152-59560-4113-mediawiki-668b75b6c5-444qf      4/4     Running   0               78m

When you look at one of the corresponding jobs you see timeouts, e.g. https://integration.wikimedia.org/ci/view/All/job/wikilambda-catalyst-end-to-end/2030/console. Note that the env creation actually succeeded.

This seems to be a combination of two things:

  1. Our env creation times have been slowly climbing up. A quick glance at https://patchdemo.wmcloud.org/ shows that
  2. An engineer from Abstract Wiki pushed a big batch of patches in one go that ended up piling up on Catalyst:

I think we should a) reduce env cleanup time for CI envs to just a couple of hours to avoid leaving accumulating dead envs behind and b) increase the timeout for the CI jobs

I think we should a) reduce env cleanup time for CI envs to just a couple of hours to avoid leaving accumulating dead envs behind and b) increase the timeout for the CI jobs

+1 for a—short env cleanup time should be the default. If folks need an environment for debugging, allowing them to adjust that manually would be better than defaulting to a long cleanup time.

Yet another interesting situation. Two patches were pushed in quick succession: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/1240009

This triggered two jobs, the first job got canceled and left an env behind: https://integration.wikimedia.org/ci/view/All/job/wikilambda-catalyst-end-to-end/2090/console

mw-ext-wl-ci-1240009-71236-4198-js-evaluator-6b858c44cc-xkvj7   1/1     Running   0               39m
mw-ext-wl-ci-1240009-71236-4198-py-evaluator-5b7dc55654-hvtpk   1/1     Running   0               39m
mw-ext-wl-ci-1240009-71236-4198-artifact-warehouse              1/1     Running   0               39m
mw-ext-wl-ci-1240009-71236-4198-mariadb-b678d48f-6cmnq          1/1     Running   0               39m
mw-ext-wl-ci-1240009-71236-4198-mediawiki-86f5b84c58-jq6jx      4/4     Running   0               39m
thcipriani set the point value for this task to 3.

Assigning to @jeena to set timeout limits for these environments to 1 hour in Gerrit's CI & GitLab.

Change #1245498 had a related patch set uploaded (by Jeena Huneidi; author: Jeena Huneidi):

[integration/config@master] catalyst jobs: Add expry time env var

https://gerrit.wikimedia.org/r/1245498

Change #1247087 had a related patch set uploaded (by Hashar; author: Jeena Huneidi):

[integration/config@master] catalyst jobs: update Jenkins job to support expiry time

https://gerrit.wikimedia.org/r/1247087

Change #1245498 merged by jenkins-bot:

[integration/config@master] catalyst jobs: Add expiry time env var

https://gerrit.wikimedia.org/r/1245498

Change #1247087 merged by jenkins-bot:

[integration/config@master] catalyst jobs: update Jenkins job to support expiry time

https://gerrit.wikimedia.org/r/1247087

OK, deployed and seems to be passed through. Can we declare this Resolved?