Page MenuHomePhabricator

Ease toolforge api usage from within pods
Closed, DuplicatePublicFeature

Description

Feature summary (what you would like to be able to do and where):

From within a kubernetes pod, interacting via the API gateway (to e.g. the jobs api) using the tool credentials.

The equivalent from within the account on a bastion host would be something like

from toolforge_weld.api_client import ToolforgeClient
from toolforge_weld.config import load_config
from toolforge_weld.kubernetes_config import Kubeconfig

def main():
    config = load_config("test-job")
    client = ToolforgeClient(server=config.api_gateway.url, kubeconfig=Kubeconfig.load(), user_agent="Example Job")
    print(client.get("/jobs/v1/tool/cluebotng-trainer/jobs/"))


if __name__ == '__main__':
    main()

This however doesn't work nicely from within a pod:

(1)
Context: Within a job, no NFS mount
Fails: `Kubeconfig.load() due to no kube config file
Fails: Kubeconfig.from_container_service_account() due to invalid SSL cert & 403 while using service account

(2)
Context: Within a build service, with NFS mount
Fails: `Kubeconfig.load() due to kube config file not found (HOME = /workspace, no checking of TOOL_DATA_DIR for file)

Additionally, there is no environment variable for the tool name/user name/namespace, which makes using the functions/api clunky (we can parse the name out of $TOOL_DATA_DIR, but should that be set when there is no nfs mount?)

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):

I have a job (https://github.com/cluebotng/trainer) which steps through a sequence of steps for a number of jobs, each "step" gets a clean environment (with artefacts from a previous "step").

You can imagine this is similar to what something like Airflow would provide with a DAG based workflow model:

cbng-trainer run-edit-sets
`-> creates coord-xxx job via jobs api
    `-> creates xx-download job
    `-> creates xx-train job
    `-> creates xx-trial job

`-> creates coord-xxx job via jobs api
    `-> creates xx-download job
    `-> creates xx-train job
    `-> creates xx-trial job

(The coordination jobs are created via jobs and the steps directly in kubernetes, however the goal is to do everything via the jobs api to be able to use e.g. logs).

The results of each "step" is stored under https://cluebotng-trainer.toolforge.org (providing very basic "object store") for later usage.

The api calls are constructed using toolforge_weld along the following lines:

def _client_config():
    config = load_config("cluebotng-trainer")
    return ToolforgeClient(
        server=f"{config.api_gateway.url}",
        kubeconfig=Kubeconfig.load(),
        user_agent="ClueBot NG Trainer",
    )

To get a (minimal) kubernetes config (for Kubeconfig.load()), without relying on NFS, we have a wrapper script that writes out the file from envvars.

Relevant code snippet:

if [ ! -f "$HOME/.kube/config" ];
then
  mkdir -p /workspace/.kube

  echo "$K8S_CLIENT_CRT" > /workspace/.kube/client.crt
  echo "$K8S_CLIENT_KEY" > /workspace/.kube/client.key

  cat > /workspace/.kube/config <<EOF
apiVersion: v1
clusters:
- cluster:
    insecure-skip-tls-verify: true
    server: ${K8S_SERVER}
  name: toolforge
contexts:
- context:
    cluster: toolforge
    namespace: tool-cluebotng-trainer
    user: tf-cluebotng-trainer
  name: toolforge
current-context: toolforge
kind: Config
users:
- name: tf-cluebotng-trainer
  user:
    client-certificate: /workspace/.kube/client.crt
    client-key: /workspace/.kube/client.key
EOF

  export KUBECONFIG="/workspace/.kube/config"
fi

This is a bit clunky as it requires the client crt/key to be loaded into envvars and will break when the credentials are created.

Thus this request falls into 2 parts:

(1)
In addition to https://wikitech.wikimedia.org/wiki/Help:Toolforge/Envvars#Globally_set_environment_variables can there be an environment variable set for the tool name (suitable for using to construct either the username/namespace or using to query tool data e.g. via the jobs api).

(2)
Expose credentials that can be used to access the internal APIs (e.g. jobs), this could either be via the service account, or the same client cert/key that is written to NFS also be loaded into the envvars (similar to the database credentials).

"Normally" this would be done via the service account or ephemeral credentials, but given the exposure of the home dir via NFS and database credentials via envvars, re-using the tool credentials seems acceptable.

On the flip side, allowing the service account would safe effort of having to maintain the envvar entries.

Conceptually something similar to this should work within a clean (Python + toolforge_weld) pod

import os

from toolforge_weld.api_client import ToolforgeClient
from toolforge_weld.config import load_config
from toolforge_weld.kubernetes_config import Kubeconfig

tool_name = os.environ.get('TOOL_NAME')
config = load_config(tool_name)
client = ToolforgeClient(
    server=f"{config.api_gateway.url}",
    kubeconfig=Kubeconfig.from_container_service_account(namespace=f'tool-{tool_name}'),
    user_agent="ClueBot NG Trainer",
)
print(client.get(f"/jobs/v1/tool/{tool_name}/jobs/"))

(Today this will fails the ssl cert check + return a 403 with SSL verification turned of)

Benefits (why should this be implemented?):

This would make scheduling of pods via the "supported method" significantly easier for users.

It would enable more flexible usage of toolforge jobs, supporting more diverse workloads.

"Native" support reduces the overhead and complexity of needing to manage duplicating credentials on the maintainers end.

Event Timeline

Hi @DamianZaremba, can you please associate one or more active project tags with this task (via the Add Action...Change Project Tags dropdown)? That will allow to see a task when looking at project workboards or searching for tasks in certain projects, and get notified about a task when watching a related project tag. Thanks!

Aklapper renamed this task from Ease `toolfoge_weld` usage from within pods to Ease `toolforge_weld` usage from within pods.Aug 3 2025, 3:54 PM

Sorry, I forgot that it doesn't pick up the context of the workboard currently being viewed... added Toolforge.

The cert does appear to get rotated? The trigger job started failing with:

Exception: Failed to create coord-legacy-report-interface-import: [400] <html>
<head><title>400 The SSL certificate error</title></head>
<body>
<center><h1>400 Bad Request</h1></center>
<center>The SSL certificate error</center>
<hr><center>nginx/1.21.0</center>
</body>
</html>

Copying the crt/key into envvars allowed it to work again.

Note that toolforge_weld was created to run for internal toolforge service, so it's not meant to be used by users (that's why it's tweaked for), it's internal to toolforge and might get changed/go away without notice. We have in plan to generate a proper client using the openapi spec, but for now there's no official toolforge python library.

The only supported interfaces is using the api directly (ex. requests but you have to load the certs yourself), or using the clients from the shell toolforge ..., that require you to mount NFS currently.

Now said that, allowing people to use the API without having to mount NFS would be really great :)

Something we can try is:

  • store the certs in secrets, and mount on the pods automatically
  • store the toolname in an envvar (ex. $TOOL_NAME)

with that the clients can have access to everything they need without NFS

I'm fine with using requests directly and handling what the shim layer does directly.

Perhaps there should be a note on https://pypi.org/project/toolforge-weld/ if this is not intended to be user facing.

This task is more about how to get the secrets without NFS (otherwise I can just mount nfs and call the toolforge cli command).

I'll tweak the description to be a bit more generic.

DamianZaremba renamed this task from Ease `toolforge_weld` usage from within pods to Ease toolforge api usage from within pods.Aug 7 2025, 4:34 PM

Perhaps there should be a note on https://pypi.org/project/toolforge-weld/ if this is not intended to be user facing.

Yep, we can make it clearer, Shared Python code for Toolforge infrastructure components. is probably not enough, I'll send a quick patch.