Page MenuHomePhabricator

[components-api,beta] CI pipelines should wait until Toolforge deployment is 100% successful
Closed, ResolvedPublicFeature

Description

What: My CI pipeline should only succeed after Toolforge has completed the requested deployment. If Toolforge ran into some problem, I would like my CI pipeline to fail.

Problem: Currently, the API reports success already upon accepting a deployment request into its internal queue. For example, this pipeline run succeeded immediately after Toolforge received the deployment request. However, at this point in time, Toolforge had not actually deployed anything yet. This makes the GitLab pipeline status rather misleading. Currently, even if there is a nice green checkmark in GitLab/GitHub,/etc., the Toolforge deployment still might have failed.

Proposal: In the Toolforge Components API server, expose an API that allows clients to poll the status of their deployment. The HTTP status code of this request would be one of the following:

  • 200 OK — Toolforge has completely finished the deployment. The container was built successfully, it was successfully uploaded to the container registry, all jobs and webservices have successfully been re-started, and any configured health checks have passed at least once.
  • 429 Too Many Requests (or some other status code that curl recognizes as retryable) — Toolforge is still working on the deployment.
  • another HTTP error (which curl does not consider retryable) – Toolforge deployment has failed for some reason.

Given this, users could use curl --retry-max-time 3600 to wait for up to 1 hour until the deployment has either finished or failed, using the exponential back-off built into curl.

Optimization: To reduce polling traffic and make CI pipelines finish faster, let the server wait until either deployment status is final, or some timeout (perhaps 3 minutes or so) has expired. Of course, this is only realistic if you've implemented the server in a language/framework that can handle many parallel long-standing requests; I wouldn't know if that is the case. But this is just an optimization.

Event Timeline

JJMC89 changed the subtype of this task from "Task" to "Feature Request".Jul 2 2025, 6:44 PM

I agree that is should be possible to block the CI result on either a deployment success or a hard failure.

From a protocol design point of view, 202 Accepted for the POST that kicks things off with a response body giving some identifier to use for polling makes sense to me. I would then imagine a GET endpoint that returned 404 Not Found with a Retry-After header suggesting a polling interval until it eventually returns a 200 OK once the deployment has been fully realized. In a flow where the POST is blocking, returning a 201 Created on success would make sense. Clients that don't care can stop at the 202 and avoid polling.

I don't think that curl --retry-max-time 3600 is the ideal polling mechanism simply because the status codes that --retry applies to are not semantically well suited to RESTful workflow. A poll; while response_status == 404: sleep; poll; end lopping structure takes a bit more work on the client's part, but makes for a more semantically well defined protocol flow in my head.

I suspect that there's too many possible states for a deployment to express with an HTTP return code. Hmm....

I think that the simplest (maybe a first iteration?) might be to enable the GET endpoint for a deployment to authenticate with the token also, and just parse the return object, that contains a more specific state for the deployment (and information about what went wrong).

Currently the creation of the deployment already returns the object with the id to poll (on success, with 200 OK):

{
    "data": {
        "deploy_id": "20250703-075322-5zxh2hip26",    <--- this one
        "creation_time": "20250703-075322",
        "builds": {
...
        },
        "runs": {
...
        },
        "status": "pending",
        "long_status": "",
        "force_build": false,
        "force_run": false
    },
    "messages": {
        "info": [
            "Deployment for sample-complex-app created successfully."
        ],
        "warning": [
            "You are using a beta feature of Toolforge."
        ],
        "error": []
    }
}

So the polling would be just looping around, curling the endpoint GET /components/v1/tool/sample-complex-app/deployment/${deploy_id}token=$TOOLS_DEPLOY_TOKEN", parsing the .data.status until it's successful (note that all the possible states are [ pending, running, failed, timed_out, successful ]).

We can return a "special" http code for deployments that are not yet finished, though I don't find any of them too appropriate, as they all already have some usage/meaning (404 is used when the deployment id given does not exist already, 408 for timeouts, 409 for rate limiting), so when reusing them for this will force the other flows that use those codes to double check not only the code, but also other bits of the response to differentiate if it's one case or the other :/

We could use a custom one (ex. 418 I'm a teapot!), not sure if that would not help a lot with curl, as it's not one of the 'retriable' codes.

Another option might be to have a special API call to wait for a deployment to happen, to simplify that workflow, this could be for example passing a follow parameter to the GET deployment, or it's own URL (maybe easier to setup if there needs to be special timeout treatment). That could be a "second iteration" too :)

Sounds like people would prefer a custom client instead of curl. In that case, the HTTP status code doesn't really matter much, the client can do whatever is needed. (Of course it would still make sense to follow REST conventions). From @dcaro's reply above, it sounds like the client should continue to poll as long as .data.status is either pending or running, using the Retry-After value in the headers of the HTTP response to compute the wait time. If the status is successful, the client should exit with a UNIX status code of zero; for anything else, the client should exit with a non-zero status so that the CI system (eg. GitLab) knows deployment has failed / timed out. While at it, this client could also write nice error messages.

Question: If the client is not curl but a custom tool, how would it get to the GitLab runner? Would Wikimedia be open to hosting a special container image for this tool, so it can be loaded via image: like in this config? If there's no other volunteers, I'll gladly write the client. (In that case, I'd probably use golang or rust so the tool does not need a Linux distribution; would that be OK for you?)

We are working on getting a client that's easily "installable" (ex. single binary T356262: [jobs-cli,builds-cli,toolforge-cli,webservice] Consolidate the Toolforge CLIs), and we have several python clients installable though pip that can be used in scripts, though currently there's no way to authenticate to the APIs from the outside besides the deploy token (T363983: [toolforge] Investigate authentication).

So the first step here is to enable auth on the API itself so clients can check the deployment status.

Might be interesting to have a mini-client that only does the deployment with tokens and that's easily installable on ci systems for simplification until we have the single binary full client (or even after). Might be simple to do with bash+curl+jq or similar though, something like (untested):

set -o errexit
set -o pipefail

deployment_id=$(curl https://... | jq '.data.deploy_id')
deployment_status=$(curl https://.../$deployment_id | jq '.data.status')
while [[ "$deployment_status" == "pending" || "$deployment_satus" == "running" ]]; then
    sleep 10
    deployment_status=$(curl https://.../$deployment_id | jq '.data.status')
done

if [[ "$deployment_status" != "success" ]]; then
    echo "Deployment failed! Full deployment state:"
    curl https://..../$deployment_id | jq
    exit 1
else
    echo "Deployment successful"
fi

group_203_bot_f4d95069bb2675e4ce1fff090c1c1620 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/879

components-api: bump to 0.0.135-20250710134503-c7e0923f

This should be already available: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Deploy_your_tool#Retrieve_the_deployment_status_using_the_deploy_token

I'm working on creating a reusable script for gitlab ci as sample and so people can use that directly, will update when I have it working

dcaro changed the task status from Open to In Progress.Jul 14 2025, 10:44 AM
dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 22) board.

I just added a reusable gitlab script to deploy tools that will wait for the success of the deployment (or report otherwise): https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Changelog#2025

I'll close this task with this for now, if you still want to have a single endpoint/rest workflow to track the deployment status, feel free to open a new task for it (more specific, you can copy the details here for that specific solution :) ).

Let me know how it goes!

dcaro moved this task from In Progress to Done on the Toolforge (Toolforge iteration 22) board.

@dcaro many thanks! I tried, but I still couldn't make it work.

  1. Looking at the sources of toolforge-cd/deploy-to-toolforge.yaml, the API invocations have a hardcoded string sample-complex-app. For example, on line 30; in total there's four places where this ID appears in the shared script. Perhaps this is this left over from a previous version of your code?
  2. How to correctly invoke your shared script from CI? I tried here following your example. However, the CI pipeline only executed the test state, it did not even try to deploy. Do you see what I'm doing wrong?

Looking at the sources of toolforge-cd/deploy-to-toolforge.yaml, the API invocations have a hardcoded string sample-complex-app. For example, on line 30; in total there's four places where this ID appears in the shared script. Perhaps this is this left over from a previous version of your code?

Oops, changed it, now you can set the tool name or it will use the gitlab project name by default :)

How to correctly invoke your shared script from CI? I tried here following your example. However, the CI pipeline only executed the test state, it did not even try to deploy. Do you see what I'm doing wrong?

For the included config to work, you have to set the TOOLS_DEPLOY_TOKEN variable, otherwise it will not deploy, I'll try to make that clearer in the docs, in the meantime, can you try with that and see if it helps? (you might need to set the TOOL_NAME variable too if your tool name is different than the gitlab repo, re-check the example for reference)

dcaro changed the task status from Open to In Progress.Jul 22 2025, 12:10 PM
dcaro moved this task from Done to In Review on the Toolforge (Toolforge iteration 22) board.

@Sascha I'll close again, but do reopen if you see the issue still, thanks!