Page MenuHomePhabricator

[jobs-api] Create storage layer, and save business models in persistent storage
Open, In Progress, HighPublic

Description

What we are going with

K8s backed storage, using shapeless custom resources, with versioning/migration handled by the python side of the application (TBD).

NOTE: these custom resource are not an interface with the system, users/other apis/admins should not access them directly except for debugging (same as if they were stored in a mariadb instance instead).

Tasks:

  • Create layer in the code, this includes creating it's own models and using it to save the business/core models, but still use k8s as the source of truth (the data served to users), add a config switch for changing to DB as source of truth
    • it should create the storage job if only the k8s one is found
  • Add a counter for "job found in k8s but not in storage", so we can try to track if there's many tools (and which) that create jobs directly in k8s
  • Deploy that version
  • Create script for initial import, test if needed in toolsbeta several times (it should not be destructive, it should just create the custom resources from the existing jobs), this potentially is just listing all the jobs on each tool
  • Test thoroughly the interaction between existing jobs (any job version we have in tools) and the new DB based ones (what happens if I create a job manually, will it show up? What if I modify one I created through the API? ...)
  • Maybe wait a week or so to see if there's any issues
  • Flip the config value, and start using the storage as source of truth (there should be no functional changes)
  • Do some more tests (functional + whatever comes to mind)
  • Send an email to cloud-announce that toolforge job creation through k8s will stop working (as in, they will not see those jobs running toolforge jobs list or through the api), and give a deadline (3 months?) for the change to happen.
  • Monitor the prometheus counter "job found in k8s but not in storage", if any tool creates them, reach out and help migrate to API-based creation.
  • Change the code, so when listing, to not create the storage job when it's only found in k8s, and instead just ignore it.

Initial brainstorming

This could be:

  • A new database (would need some kind of store, or be in trove)
  • As a custom resource in k8s (that would be using k8s/etcd as database)
    • This should be read-only for users, as we want to only modify it through the API (that way we don't need admission controllers or controllers at all)

More questions and likely answers (feel free to edit the below section if you have other opinions or ideas)

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  # name must match the spec fields below, and be in the form: <plural>.<group>
  name: toolforge-scheduled-job.jobs-api.toolforge.org
spec:
  # group name to use for REST API: /apis/<group>/<version>
  group: jobs-api.toolforge.org
  # list of versions supported by this CustomResourceDefinition
  versions:
    - name: v1
      # Each version can be enabled/disabled by Served flag.
      served: true
      # One and only one version must be marked as the storage version.
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                cmd:
                  type: string
                cpu:
                  type: string
                ...

  # either Namespaced or Cluster
  scope: Namespaced
  names:
    # plural name to be used in the URL: /apis/<group>/<version>/<plural>
    plural: toolforge-scheduled-jobs
    # singular name to be used as an alias on the CLI and for display
    singular: toolforge-scheduled-job
    # kind is normally the CamelCased singular type. Your resource manifests use this.
    kind: ToolforgeScheduledJob
    # shortNames allow shorter string to match your resource on the CLI
    shortNames:
    - tsj
  • in what namespace should the database be?
    • Each tool's namespace

For example, inside tool-tf-test namespace there would be a bunch of ToolforgeScheduledJob resources defining each scheduled job.

  • what are we putting in this database?
    • All the information needed to rebuild the user's jobs if needed (that means the stuff we keep in labels plus anything else needed to start that job, it does not include the status, if it's running/stopped/etc. for example)
    • (component config) - dc: @Raymond_Ndibe what do you mean with this? ans: just wondering if we are to save component configs (for component-api) as configmaps or custom resource. We might need to look at custom resource if we need structured versioning.
  • what do we do in case of versioning and migration?
    • k8s custom resources provides a way to do this using k8s resource versioning
  • how do we ensure that the database entries are in sync with the kubernetes objects they represent?
    • we don't. If someone manually makes changes the underlying kubernetes objects of a job, something will probably go wrong. For this reason the database should only be editable by the apis otherwise should be readonly (haven't thought about how to enforce that yet). If I am not mistaken this is also the current situation rn, give or take
    • on a second thought, since resourceVersions change on any edit to an object in kubernetes, we can save the resourceVersion of the underlying objects (deployments, cronjobs, jobs, services) in the custom resource. If we have this then we can detect when the underlying objects are out of sync with the custom resource object.

Related Objects

StatusSubtypeAssignedTask
ResolvedLucasWerkmeister
Resolvedmatmarex
ResolvedLegoktm
ResolvedLegoktm
In Progressdcaro
Resolveddcaro
In Progresskomla
Resolveddcaro
Resolveddcaro
ResolvedRaymond_Ndibe
OpenNone
OpenNone
StalledFeatureRaymond_Ndibe
ResolvedFeatureRaymond_Ndibe
StalledRaymond_Ndibe
ResolvedRaymond_Ndibe
In Progressdcaro

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I would be happy to talk about this re-architecture idea. I can share a bit more info about what I tested in the past, and what architecture I had in mind when I first created this, although the code is maybe self-explanatory already.

That'd be useful yes, I've added an entry in the toolforge meeting tomorrow to have a chat there, happy to chat somewhere else too though if you prefer

made some attempt to define somethings and answer some important questions on the task description, based on our discussion @dcaro . Input and possible modifications are welcome

made some attempt to define somethings and answer some important questions on the task description, based on our discussion @dcaro . Input and possible modifications are welcome

thanks! I think there's still some confusion xd, feel free to send me an invite for a quick chat and we can clarify further

Raymond_Ndibe changed the task status from Open to In Progress.Jul 2 2024, 1:14 PM

raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/101

[jobs-api] custom resource definitions deployment templates

I saw this patch, and I don't like the tradeoff with the additional complexity that this approach introduces.

What is the problem we are trying to solve? I don't remember and I can't find where that has been recorded.

If is just that we would like to store the original user command somewhere, we can use an annotation.

Beware, label values and similar have limitations on what characters they can store.

Annotations values don't have the same limitations as label values.

  • they can include pretty much any character
  • they can be up to 256KB in size
  • we could even store a YAML or JSON like string inside an annotation. This is what the kubectl.kubernetes.io/last-applied-configuration annotations contains

Example:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    toolforge-jobs-framwork-data: |
      {
        "original-command": "some-command.sh --with 'arguments'",
      }

raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/101

[jobs-api] custom resource definitions deployment templates

I saw this patch, and I don't like the tradeoff with the additional complexity that this approach introduces.

What is the problem we are trying to solve? I don't remember and I can't find where that has been recorded.

From a previous comment in this task:

I would be happy to talk about this re-architecture idea. I can share a bit more info about what I tested in the past, and what architecture I had in mind when I first created this, although the code is maybe self-explanatory already.

That'd be useful yes, I've added an entry in the toolforge meeting tomorrow to have a chat there, happy to chat somewhere else too though if you prefer

https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Monthly_meeting/2024-03-12

The notes are not very detailed though, but that was a long meeting (with you involved) in which we decided to go on with the separation of concerns. See also the parent task T359804: [jobs-api] Refactor before webservice support, and as I said in the meeting and in this task after the meeting, I'm happy to have another chat if you want to clarify things further if you want.

If is just that we would like to store the original user command somewhere, we can use an annotation.

It's not, that's one of the several issues of trying to embed jobs-api models into k8s api existing resources (the biggest one being that they don't match 1:1, another one being that it's hard to keep track of the versioning of them).

dcaro renamed this task from [jobs-api] Save business models in a DB to [jobs-api] Create storage layer, and save business models in persistent storage.Mar 27 2025, 8:52 AM
dcaro updated the task description. (Show Details)
Raymond_Ndibe reopened this task as In Progress.
Raymond_Ndibe moved this task from In Review to Done on the Toolforge (Toolforge iteration 19) board.
Raymond_Ndibe moved this task from Done to In Review on the Toolforge (Toolforge iteration 19) board.
dcaro changed the task status from In Progress to Stalled.May 6 2025, 12:33 PM
dcaro changed the task status from Stalled to In Progress.Oct 8 2025, 1:45 PM

group_203_bot_f4d95069bb2675e4ce1fff090c1c1620 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1169

jobs-api: bump to 0.0.474-20260305094134-618549fd

group_203_bot_f4d95069bb2675e4ce1fff090c1c1620 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1170

jobs-api: bump to 0.0.475-20260305111008-ede9b4bf

dcaro changed the task status from Stalled to In Progress.Mar 5 2026, 2:56 PM
dcaro added a project: tools-platform-team.
dcaro moved this task from Todos to In progress on the tools-platform-team board.
dcaro added a subscriber: Raymond_Ndibe.