We noticed a lot of environments running for Wikilambda while no tests were running. Our suspicion is that these are lingering environments that are not cleaned up when tests fail.
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| catalyst jobs: update Jenkins job to support expiry time | integration/config | master | +6 -1 | |
| catalyst jobs: Add expiry time env var | integration/config | master | +11 -2 |
| Title | Reference | Author | Source Branch | Dest Branch | |
|---|---|---|---|---|---|
| Check hourly for expired wikis | repos/test-platform/catalyst/patchdemo!272 | jhuneidi | T416391 | main | |
| Delete expired wikis hourly | repos/test-platform/catalyst/catalyst-api!162 | jhuneidi | T416391 | main | |
| Write to artifact file directly after env creation | repos/test-platform/catalyst/catalyst-ci-client!2 | jhuneidi | jhuneidi-main-patch-cf68 | main |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | BUG REPORT | None | T415952 Intermittent catalyst build failures for wikilambda with ERROR: Environment logs are still not ready. | ||
| Open | jeena | T416391 Clean up Wikilambda Catalyst environments regardless of test failures |
Event Timeline
It looks like the environments for wikilambda ci that aren't getting cleaned up are those that did not get a response from catalyst in the appropriate amount of time. When the request to catalyst create times out, then we don't get an environment ID back from catalyst, so even though the post build script has been running, it can't delete the environment.
+ exec docker run --entrypoint=/deploy_env.py -e WIKILAMBDA_REF=16/1229116/19 -e ZUUL_CHANGE=1229116 -e ENV_API_PATH=https://api.catalyst.wmcloud.org/api/environments -e NPM_ARGS=selenium-test -e MEDIAWIKI_USER=Admin -e MEDIAWIKI_PASSWORD=dockerpass -e MW_SCRIPT_PATH=/w --volume /srv/jenkins/workspace/wikilambda-catalyst-end-to-end/src:/src --volume /srv/jenkins/workspace/wikilambda-catalyst-end-to-end/cache:/cache --volume /srv/jenkins/workspace/wikilambda-catalyst-end-to-end/log:/log --security-opt seccomp=unconfined --init --rm --label jenkins.job=wikilambda-catalyst-end-to-end --label jenkins.build=1920 --env-file /dev/fd/63 docker-registry.wikimedia.org/releng/catalyst:1.3.0-s1
++ set +x
ERROR: Failed to create Wikifunctions environment: 504 Server Error: Gateway Time-out for url: https://api.catalyst.wmcloud.org/api/environments
Build step 'Execute shell' marked build as failure
PostBuildScript
Archiving artifacts
[PostBuildScript] - [INFO] Executing post build scripts.
[wikilambda-catalyst-end-to-end] $ /bin/bash -xe /tmp/jenkins4184931845332480205.sh
++ cat log/envid
cat: log/envid: No such file or directory
+ eval ''
+ curl -X DELETE https://api.catalyst.wmcloud.org/api/environments/ -H 'Authorization: ApiToken ****'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
...See https://integration.wikimedia.org/ci/view/All%20jobs/job/wikilambda-catalyst-end-to-end/1920/console
That failed run and other similar ones line up with the times for yesterday's incident: https://wikitech.wikimedia.org/wiki/Catalyst/Incidents/2026-02-10
The cluster was already flailing when those pipelines ran and that's probably why they were seeing time-outs (and other errors: https://integration.wikimedia.org/ci/view/All%20jobs/job/wikilambda-catalyst-end-to-end/1927/console)
I caught one in the wild, this time from gitlab: https://gitlab.wikimedia.org/repos/test-platform/catalyst/catalyst-ci-client/-/pipelines/164402
The pipeline left behind a (now deleted) env. The creation job succeeded in creating the env and got an id back (3975) but then timed out waiting for the logs:https://gitlab.wikimedia.org/repos/test-platform/catalyst/catalyst-ci-client/-/jobs/739151
The pipeline didn't run the deletion job and the env was left behind. This is already a fix we can add -> always run the deletion job unconditionally
Yes, that's correct. I remember having a conversation about it so when I saw this task I thought there must be another task for it, but I don't see one.
Another data point. Today the Jenkins Wikilambda CI left a bunch of envs behind:
mw-ext-wl-ci-1238784-24724-4109-py-evaluator-547786dd5c-snpqv 1/1 Running 0 85m mw-ext-wl-ci-1238784-24724-4109-js-evaluator-97fff64f4-trkbd 1/1 Running 0 85m mw-ext-wl-ci-1238784-24724-4109-artifact-warehouse 1/1 Running 0 85m mw-ext-wl-ci-1238784-24724-4109-mariadb-5b4685c7b9-t6qgz 1/1 Running 0 85m mw-ext-wl-ci-1238784-24724-4109-mediawiki-88494f449-jszph 4/4 Running 0 85m mw-ext-wl-ci-1239152-82482-4110-js-evaluator-7dd6465b6f-bshbn 1/1 Running 0 79m mw-ext-wl-ci-1239152-82482-4110-py-evaluator-c756459cd-5njgb 1/1 Running 0 79m mw-ext-wl-ci-1239344-35683-4111-js-evaluator-7cc5cc6dc4-qfsfd 1/1 Running 0 79m mw-ext-wl-ci-1239344-35683-4111-py-evaluator-54797b96f7-q75zx 1/1 Running 0 79m mw-ext-wl-ci-1239152-82482-4110-artifact-warehouse 1/1 Running 0 79m mw-ext-wl-ci-1239344-35683-4111-artifact-warehouse 1/1 Running 0 79m mw-ext-wl-ci-1239152-82482-4110-mariadb-689dbdc869-t6dm7 1/1 Running 0 79m mw-ext-wl-ci-1239344-35683-4111-mariadb-f95cb96cc-tp55n 1/1 Running 0 79m mw-ext-wl-ci-1239345-10059-4112-py-evaluator-65b68bb7c5-hw8xd 1/1 Running 0 78m mw-ext-wl-ci-1239345-10059-4112-js-evaluator-69c4755b7b-lqnnw 1/1 Running 0 78m mw-ext-wl-ci-1239345-10059-4112-artifact-warehouse 1/1 Running 0 78m mw-ext-wl-ci-1239345-10059-4112-mariadb-68bd8bc87c-6bjzp 1/1 Running 0 78m mw-ext-wl-ci-1239152-59560-4113-py-evaluator-587bc78bc8-gksvs 1/1 Running 0 78m mw-ext-wl-ci-1239152-59560-4113-artifact-warehouse 1/1 Running 0 78m mw-ext-wl-ci-1239152-59560-4113-mariadb-68f895bf85-469ls 1/1 Running 0 78m mw-ext-wl-ci-1239152-59560-4113-js-evaluator-67689db7c7-wpxgn 1/1 Running 0 78m mw-ext-wl-ci-1239344-62262-4114-js-evaluator-55cc55dff8-kmd6x 1/1 Running 0 77m mw-ext-wl-ci-1239344-62262-4114-artifact-warehouse 1/1 Running 0 77m mw-ext-wl-ci-1239344-62262-4114-mariadb-65c79bbb57-k24gs 1/1 Running 0 77m mw-ext-wl-ci-1239344-62262-4114-py-evaluator-75c4f5cff7-b8778 1/1 Running 0 77m mw-ext-wl-ci-1239344-35683-4111-mediawiki-6cf876bb58-4vmtz 4/4 Running 0 79m mw-ext-wl-ci-1239344-62262-4114-mediawiki-5d54bcdcc9-gqmpt 4/4 Running 0 77m mw-ext-wl-ci-1239152-82482-4110-mediawiki-5fb7ddb8cb-b9jhs 4/4 Running 0 79m mw-ext-wl-ci-1239345-10059-4112-mediawiki-5c64d87fb7-9x9j6 4/4 Running 0 78m mw-ext-wl-ci-1239152-59560-4113-mediawiki-668b75b6c5-444qf 4/4 Running 0 78m
When you look at one of the corresponding jobs you see timeouts, e.g. https://integration.wikimedia.org/ci/view/All/job/wikilambda-catalyst-end-to-end/2030/console. Note that the env creation actually succeeded.
This seems to be a combination of two things:
- Our env creation times have been slowly climbing up. A quick glance at https://patchdemo.wmcloud.org/ shows that
- An engineer from Abstract Wiki pushed a big batch of patches in one go that ended up piling up on Catalyst:
- https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/1239152
- https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/1239344
- https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/1239152
- https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/1239344
- https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/1239345
I think we should a) reduce env cleanup time for CI envs to just a couple of hours to avoid leaving accumulating dead envs behind and b) increase the timeout for the CI jobs
+1 for a—short env cleanup time should be the default. If folks need an environment for debugging, allowing them to adjust that manually would be better than defaulting to a long cleanup time.
Yet another interesting situation. Two patches were pushed in quick succession: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/1240009
This triggered two jobs, the first job got canceled and left an env behind: https://integration.wikimedia.org/ci/view/All/job/wikilambda-catalyst-end-to-end/2090/console
mw-ext-wl-ci-1240009-71236-4198-js-evaluator-6b858c44cc-xkvj7 1/1 Running 0 39m mw-ext-wl-ci-1240009-71236-4198-py-evaluator-5b7dc55654-hvtpk 1/1 Running 0 39m mw-ext-wl-ci-1240009-71236-4198-artifact-warehouse 1/1 Running 0 39m mw-ext-wl-ci-1240009-71236-4198-mariadb-b678d48f-6cmnq 1/1 Running 0 39m mw-ext-wl-ci-1240009-71236-4198-mediawiki-86f5b84c58-jq6jx 4/4 Running 0 39m
Assigning to @jeena to set timeout limits for these environments to 1 hour in Gerrit's CI & GitLab.
jhuneidi opened https://gitlab.wikimedia.org/repos/test-platform/catalyst/catalyst-ci-client/-/merge_requests/2
Write to artifact file directly after env creation
jhuneidi merged https://gitlab.wikimedia.org/repos/test-platform/catalyst/catalyst-ci-client/-/merge_requests/2
Write to artifact file directly after env creation
Change #1245498 had a related patch set uploaded (by Jeena Huneidi; author: Jeena Huneidi):
[integration/config@master] catalyst jobs: Add expry time env var
jhuneidi opened https://gitlab.wikimedia.org/repos/test-platform/catalyst/catalyst-api/-/merge_requests/162
Delete expired wikis hourly
jhuneidi opened https://gitlab.wikimedia.org/repos/test-platform/catalyst/patchdemo/-/merge_requests/272
Check hourly for expired wikis
Change #1247087 had a related patch set uploaded (by Hashar; author: Jeena Huneidi):
[integration/config@master] catalyst jobs: update Jenkins job to support expiry time
Change #1245498 merged by jenkins-bot:
[integration/config@master] catalyst jobs: Add expiry time env var
Change #1247087 merged by jenkins-bot:
[integration/config@master] catalyst jobs: update Jenkins job to support expiry time
jhuneidi merged https://gitlab.wikimedia.org/repos/test-platform/catalyst/catalyst-api/-/merge_requests/162
Delete expired wikis hourly
jhuneidi merged https://gitlab.wikimedia.org/repos/test-platform/catalyst/patchdemo/-/merge_requests/272
Check hourly for expired wikis