There seems to be missing docs on how to operate the cluster, for example how to restart a given core component in case of suspected misbehavior.
Example of why this is important is the ticket T380832: [jobs-api] crashing in which an operator responded to an incident did not know how to restart the jobs-api.
We could:
- create docs
- builds-service https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Build_Service
- envvars-service https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Envvars_Service
- jobs-service (review https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Jobs_Service)
- crate alerts
- builds-service https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/21
- envvars-service https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/22
- jobs-service
- create automation
- have some training for team members