Page MenuHomePhabricator

Catalyst API fails to get log during rebuild
Closed, ResolvedPublic3 Estimated Story Points

Description

While rebuilding an environment @DLynch got this stack trace:

Fatal error: Uncaught Symfony\Component\HttpClient\Exception\ServerException: HTTP/2 503 returned for "https://api.catalyst.wmcloud.org/api/environments/1584/logs?stream=mediawiki/install-mediawiki". in /var/www/html/vendor/symfony/http-client/Response/CommonResponseTrait.php:169 Stack trace: #0 /var/www/html/vendor/symfony/http-client/Response/CommonResponseTrait.php(47): Symfony\Component\HttpClient\Response\AsyncResponse->checkStatusCode() #1 /var/www/html/Catalyst.php(52): Symfony\Component\HttpClient\Response\AsyncResponse->getContent() #2 /var/www/html/create.php(563): Catalyst->streamLogs() #3 {main} thrown in /var/www/html/vendor/symfony/http-client/Response/CommonResponseTrait.php on line 169

Inside the Catalyst API logs I see:

[GIN] 2025/09/18 - 19:51:10 | 503 |    97.42768ms |       10.42.0.1 | GET      "/api/environments/1584/logs?stream=mediawiki/install-mediawiki"
Error #01: operation not available for environment with id '1584'. Current env status is 'failed' 
2025/09/18 19:51:14 Environment 'wiki-5ad535f7f7' failed to start up due to:                                                                                                                    
*errors.errorString something failed on the cluster side for environment 'wiki-5ad535f7f7'                                                                                                      
/srv/app/service/environment.go:221 (0x1fafe85)                                                                                                                                                 
/srv/app/service/environment.go:364 (0x1fb0e65)                                                                                                                                                 
/srv/app/persistence/repository.go:111 (0x1f8b973)                                       
/srv/app/persistence/repository.go:119 (0x1f8b862)                                                                                                                                              
/go/pkg/mod/gorm.io/gorm@v1.30.0/finisher_api.go:654 (0x85e85a)                                                                                                                                 
/srv/app/persistence/repository.go:117 (0x1f8b62e)                                                                                                                                              
/srv/app/persistence/repository.go:106 (0x1f8b5e2)                                                                                                                                              
/srv/app/service/environment.go:349 (0x1fb0ce5)
/srv/app/service/environment.go:337 (0x1fafab3)
/srv/app/service/environment.go:337 (0x1fafaa3)
/usr/local/go/src/runtime/asm_amd64.s:1700 (0x47bb81)

But I also see that the environment is running and happy:

thcipriani@k3s:~$ sudo KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n cat-env get po | grep -P '(NAME|1584)'
NAME                                              READY   STATUS     RESTARTS        AGE
wiki-5ad535f7f7-1584-mediawiki-6f44994d9f-jf5mh   2/2     Running    0               20m
thcipriani@k3s:~$ date --utc -Is
2025-09-18T20:11:31+00:00

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
improve handling of K8s deployment events in the environment watcherrepos/test-platform/catalyst/catalyst-api!135jnucheT405035main
Customize query in GitLab

Event Timeline

Tagging in @jnuche here since I am wondering if this is related to the recent deployment of T396871: Automatically notify Catalyst when an env deployment's state changes. The reason for my wonderment is:

  • It changed today
  • the logs show operation not available for environment with id '1584'. Current env status is 'failed' and I remember something about environment status

Interesting, it looks like during the rebuild the environment transitioned to failed temporarily before recovering.

I can't reproduce the issue locally or in prod, it's possible this was a hiccup in the K8s cluster itself. @DLynch would it be ok if I rebuild your wiki-5ad535f7f7 env again?

This also happened to me when rebuilding e34a5ae872 yesterday at around ~18:10 UTC. Attempting to rebuild the wiki for a second time worked.

I've been able to reproduce the issue locally and I think I know what's happening.

The new env watcher adds labels to the environments so they can be tracked, older envs don't have those labels so the API adds them during a rebuild. But it turns out the label change triggers a deployment deletion event from the cluster, which is interpreted by the API as the env being unavailable. Once the new, labelled version of the deployment comes to life, the API detects that and restores the env.

If my theory is right the problem will affect any envs created before we rolled out the env watcher and it will reproduce the first time the env is rebuilt. It's a transitory state, and the environment itself is fine and will recover, but it's a bit scary to see that failure when you first try to rebuild your env.

@A_smart_kitten Can I ask you to try to rebuild one of your old envs and see if the same pattern repeats? A failure followed by a succesful rebuild. In fact, if I'm correct the second rebuild shouldn't even be necessary, the env will eventually recover after a few minutes; simply reloading the Patchdemo UI should show the env running again.

Copying IRC conversation from #wikimedia-releng (log):

2025-09-19 11:10:25  <A_smart_kitten> jnuche: do I understand your "older envs don't have those labels" comment in https://phabricator.wikimedia.org/T405035#11196930 correctly, to mean that this error should only occur when rebuilding an older environment?
2025-09-19 11:11:11 <jnuche> A_smart_kitten: yeah, that's my current working theory
2025-09-19 11:12:27 <jnuche> "older" as in created before yesterday at 13:29 UTC :)
2025-09-19 11:12:29 <A_smart_kitten> jnuche: huh, bc the envionment i rebuilt had only been created initially less than 20mins or so beforehand. (I had to rebuild it to take in a DNM patch I created to enable a couple of config vars, in lieu of T372980 :P)
2025-09-19 11:13:39 <jnuche> ah, I didn't mean to imply old as in "it's been running for a while". Anything created before the time I mention above should be affected
2025-09-19 11:14:41 <jnuche> A_smart_kitten: I think I have a fix, but I'd like to confirm my theory first
2025-09-19 11:15:03 <A_smart_kitten> ah, i didn't see your second reply before i sent mine; but as it happens, the env this happened to me on yesterday was initially created at 2025-09-18 17:56:36
2025-09-19 11:15:33 <A_smart_kitten> according to the patch demo UI, which seems accurate to my memory
2025-09-19 11:15:51 <jnuche> hum, yeah, that wouldn't fit then
2025-09-19 11:17:06 <A_smart_kitten> FWIW i just rebuilt 3387e46cc4 without changing any of the config and I didn't get this error

FWIW though @jnuche, I did reproduce the error just now with f9030db6db by:

Thanks @A_smart_kitten, I could reproduce the issue using your steps. I also saw that the env recovered on its own without the need for a second rebuild.

I think my original idea may still be on the right track, even if it's not the full explanation. Something besides envs without labels is triggering transient cluster-side deletion events.

At this point I'm going to add some code to the API to prevent the symptoms and monitor. In the code path that controls the transition of an env to failed during a rebuild, I'm going to add a synchronous check for the env status; I'll also add some logging there. This should prevent the exception being thrown back to the users, whatever the actual root cause is, and give us some logs in the future when it triggers again.

jnuche claimed this task.
jnuche edited projects, added Catalyst (noka); removed Catalyst (PatchDemo).
jnuche set the point value for this task to 3.

As I was trying to decide which info would be useful to log for future troubleshooting, I went down a rabbit hole trying to find out the exact meaning and behavior of the fields being returned in the cluster events.

In the end I fixed two issues when processing deployment events:

  1. Deployment's generation needs to be taken into account
  2. Available status is not enough, the update status of the associated replica set also needs to be considered

That seems to have fixed the original errors. I can't be 100% sure there aren't any other subtle pitfalls lurking around, but I think this is enough for now.

I'm closing this tentatively, please reopen if we see the same errors in the API logs again

To confirm: I rebuilt the specific patchdemo that was causing reproducible issues for me, and it's definitely fixed for that one.