Currently, tools just default to writing log files on to NFS. While simple, this causes a number of problems:
- It adds additional load on our NFS server, which isn't already doing great
- Logrotate is a PITA with NFS
- It's not accessible from the API, that prevents using it in other places (like Toolforge UI)
- K8s container logs are only available for a limited time
What
A service, exposed through the toolforge API (https://api-docs.toolforge.org), and implemented for now as a cli subcommand (toolforge logs) that allows to:
- Get the logs for one of your jobs
- Get the logs for your webservice
- Get the logs for your scheduled jobs (have to think how to expose different runs)
- Get the logs for your one-off jobs
- Get the logs for your builds
- In addition to persisting the logs, we need to persist /what logs/ are persisted and where they came from, maybe a table relating a given log to a given container/job + access status.
- There should be a retention policy, probably based on size, more than age of the logs (or both, but if possible not only age).
- Should be extensible enough to allow future use cases like:
- Add platform-level logs (your webservice was deployed kind of things, or failed to build), maybe some kind of ‘activity’ logs for your tool (similar to horizon activity tab for an instance)
- Ideally should be easy/transparent for users on how to use (if we can just pipe the k8s pods output there, awesome), but should also allow to push custom logs (for future usecases, like the platform ones mentioned above)
- It should only allow members of a specific tool to access the tool logs (see how other services authenticate using the api gateway for authentication user<->api)
What not
- Not handling logs persisted in the tools home (NFS), only logs generated from the run in k8s
- Supporting access to the logs as a filesystem (be tha NFS or anywhere) is out of scope, only through the cli
Access to any other UI (ex. Grafana, kibana, greylog, elasticsearch…) is secondary, and not necessary for this implementation. If available and easy to implement, it should be clearly shown (notice banner or something) that it is not stable, and should not be relied on. Using the api and toolforge logs is the future-proof way of accessing your tool’s logs (and sufficient for most use cases, our current focus). Like kubectl logs, it might be that we upgrade k8s and the command might change, get deprecated, or just not work anymore without notice.
API structure
There’s some “common standards” that we use for toolforge APIs
- Generate the code from the openapi definition or the definition from the code, but not manually maintain it (essentially, copy-paste the boilerplate from one of the existing ones, components-api/fastapi or envvars-api/golang), that will also bring ci stuff and deployment boilerplate
- All APIs have the endpoints
- /v1/healthz endpoint used from the api-gateway
- /v1/metrics endpoint used for prometheus metrics
- /openapi.json endpoint used for the api-gateway
- All endpoints have the prefix /<version>/tool/<toolname>/…, for example /v1/tool/wm-lol/builds (see https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/blob/main/openapi/openapi.yaml?ref_type=heads), the api-gateway adds then another prefix /<api_name> to it (see https://api-docs.toolforge.org)
- All endpoints except /openapi.json and /v1/metrics return a json responded, wrapped in an object {“messages”: {“info”: [], “warning”: [], “error”: []}, …} where the messages there are meta-messages (‘deployment started correctly’, ‘api endpoint deprecated’, ‘unable to find deployment’, …), the cli automatically shows those in colors and such.
How
It should be part of the toolforge set of services, this means it should be deployed in a similar way (using toolforge-deploy), using repos under the toolforge group in gitlab, similar CI/testing/etc. that the other toolforge services use.
It should be preferably deployable in lima-kilo, too, as much as it makes sense (up to the implementer on where to put that line, ex. just adding local storage, vs adding local s3 integration, ...).
K8s specifics
K8s stores logs per container: dir per pod, dir per per container, logfile inside /var/log/pods – users can only see these via ‘toolforge webservice logs/toolforge jobs logs’ which does ‘kubectl logs’
Other related tasks:
- T50846: Provide a central logging service for tools (now defunct)
- T97861: [toolforge,infra] Centralized logging for Toolforge infrastructure logs
- T122508: Prevent overly-large log files
- T127368: Estimate hardware requirements for Toolforge logging elastic cluster
- T293672: [tbs.cli] Create an easy way to extract/tail logs from buildpack based webservices