There seems to be missing docs on how to operate the cluster, for example how to restart a given core component in case of suspected misbehavior.
Example of why this is important is the ticket {T380832} in which an operator responded to an incident did not know how to restart the jobs-api.
We could:
* create docs
** [x] builds-service https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Build_Service
** [x] envvars-service https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Envvars_Service
** [] jobs-service
* crate alerts
** [x] builds-service https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/21
** [x] envvars-service https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/22
** [x] jobs-service
*** [x] jobs-api https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/20
*** [x] jobs-emailer https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-emailer/-/merge_requests/10
* create automation
* have some training for team members