Page MenuHomePhabricator

Automatically notify Catalyst when an env deployment's state changes
Closed, ResolvedPublic5 Estimated Story Points

Description

Our GET /environments endpoint accepts an optional latestStatus=true query parameter that can be used to make Catalyst reach out to the cluster and refresh the environment's status with the most recent real state.

This parameter is useful e.g. when Patchdemo retrieves all of its Catalyst wikis on a page load. However Patchdemo is not using the parameter at the moment as it adds ~12s to the request in production. After a couple of failed attempts at optimizing the operation, it turned out the main bottleneck was that the multiples request from Catalyst to K3s are eventually throttled by the cluster. The

A better approach in this case than polling for changes would be to have push events. The client-go library we are using to talk to K3s has the concept of "informers", which it seems we can use to add lifecycle hooks to our deployments. Those hooks can then notify Catalyst once a deployment is up and running, or when it goes down: https://medium.com/@jeevanragula/demystifying-kubernetes-informer-streamlining-event-driven-workflows-955285166993. At that point we can remove latestStatus from GET /environments.

Note that GET /environments/:id also accepts latestStatus but in that case we probably keep the parameter around as a fallback mechanism to get the actual running state of a particular environment in case there are issues with the cluster.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
catalyst API: drop parameter `latestStatus` from queriesrepos/test-platform/catalyst/patchdemo!203jnucheT396871main
introduce environment watcherrepos/test-platform/catalyst/catalyst-api!133jnucheT396871main
Customize query in GitLab

Event Timeline

thcipriani set the point value for this task to 5.Jul 21 2025, 4:28 PM
thcipriani edited projects, added: Catalyst (akesi npa); removed: Catalyst (Kulupu Olin).
thcipriani moved this task from Backlog to Ready on the Catalyst (akesi npa) board.
jnuche added a subscriber: jeena.

@jeena I was already working on this. I hope it's ok

Couple of notes about @jnuche's implementation (from meeting today):

  • The chart needs to add support for this feature, initial implementation will only be for the MediaWiki chart—wikifunctions is not included.
  • If an environment fails, it will be marked in Catalyst, but TTL is not set—this will only be set when creation fails. This is on purpose, if all environments fall over, we don't want to wipe all the environments on accident.

Changes are in production.

A few points worth noting:

  • latestStatus has been removed from GET /environments but kept for GET /environments/:id. This endpoint can be used to still get updates for an env in case there's a problem with the watcher
  • The 3 day TTL for failed envs applies now only for envs that fail during creation. This means:
    • It won't be applied to envs that were being rebuilt. This is done so that users don't lose any data they may have in an env that otherwise was originally created successfully and may still be recovered
    • It won't be applied to envs that were running when a cluster-side failure occurred. This is done to prevent a cluster-wide failure happening when we are not around and wiping out most or all user envs in the cluster, e.g. if it happens on a Friday evening