What we are going with
K8s backed storage, using shapeless custom resources, with versioning/migration handled by the python side of the application (TBD).
Tasks:
- Create layer in the code, this includes creating it's own models and using it to save the business/core models, but still use k8s as the source of truth (the data served to users), add a config switch for changing to DB as source of truth
- it should create the storage job if only the k8s one is found
- Add a counter for "job found in k8s but not in storage", so we can try to track if there's many tools (and which) that create jobs directly in k8s
- Deploy that version
- Create script for initial import, test if needed in toolsbeta several times (it should not be destructive, it should just create the custom resources from the existing jobs), this potentially is just listing all the jobs on each tool
- Test thoroughly the interaction between existing jobs (any job version we have in tools) and the new DB based ones (what happens if I create a job manually, will it show up? What if I modify one I created through the API? ...)
- Maybe wait a week or so to see if there's any issues
- Flip the config value, and start using the storage as source of truth (there should be no functional changes)
- Do some more tests (functional + whatever comes to mind)
- Send an email to cloud-announce that toolforge job creation through k8s will stop working (as in, they will not see those jobs running toolforge jobs list or through the api), and give a deadline (3 months?) for the change to happen.
- Monitor the prometheus counter "job found in k8s but not in storage", if any tool creates them, reach out and help migrate to API-based creation.
- Change the code, so when listing, to not create the storage job when it's only found in k8s, and instead just ignore it.
Initial brainstorming
This could be:
- A new database (would need some kind of store, or be in trove)
- As a custom resource in k8s (that would be using k8s/etcd as database)
- This should be read-only for users, as we want to only modify it through the API (that way we don't need admission controllers or controllers at all)
More questions and likely answers (feel free to edit the below section if you have other opinions or ideas)
- which database to use?
- k8s etcd (by defining a custom resource), for example (modified from https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/):
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
# name must match the spec fields below, and be in the form: <plural>.<group>
name: toolforge-scheduled-job.jobs-api.toolforge.org
spec:
# group name to use for REST API: /apis/<group>/<version>
group: jobs-api.toolforge.org
# list of versions supported by this CustomResourceDefinition
versions:
- name: v1
# Each version can be enabled/disabled by Served flag.
served: true
# One and only one version must be marked as the storage version.
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
cmd:
type: string
cpu:
type: string
...
# either Namespaced or Cluster
scope: Namespaced
names:
# plural name to be used in the URL: /apis/<group>/<version>/<plural>
plural: toolforge-scheduled-jobs
# singular name to be used as an alias on the CLI and for display
singular: toolforge-scheduled-job
# kind is normally the CamelCased singular type. Your resource manifests use this.
kind: ToolforgeScheduledJob
# shortNames allow shorter string to match your resource on the CLI
shortNames:
- tsj- in what namespace should the database be?
- Each tool's namespace
For example, inside tool-tf-test namespace there would be a bunch of ToolforgeScheduledJob resources defining each scheduled job.
- what are we putting in this database?
- All the information needed to rebuild the user's jobs if needed (that means the stuff we keep in labels plus anything else needed to start that job, it does not include the status, if it's running/stopped/etc. for example)
- (component config) - dc: @Raymond_Ndibe what do you mean with this? ans: just wondering if we are to save component configs (for component-api) as configmaps or custom resource. We might need to look at custom resource if we need structured versioning.
- what do we do in case of versioning and migration?
- k8s custom resources provides a way to do this using k8s resource versioning
- how do we ensure that the database entries are in sync with the kubernetes objects they represent?
- we don't. If someone manually makes changes the underlying kubernetes objects of a job, something will probably go wrong. For this reason the database should only be editable by the apis otherwise should be readonly (haven't thought about how to enforce that yet). If I am not mistaken this is also the current situation rn, give or take
- on a second thought, since resourceVersions change on any edit to an object in kubernetes, we can save the resourceVersion of the underlying objects (deployments, cronjobs, jobs, services) in the custom resource. If we have this then we can detect when the underlying objects are out of sync with the custom resource object.