Problem
T386480: [o11y,logging,infra] Deploy Loki to store Toolforge tool log data is setting up a Grafana Loki instance to store logs from Toolforge tool-owned jobs. Loki is used for that problem because of its multi-tenant capabilities and otherwise generally simple deployment, that part is not controversial here.
However, T97861: [toolforge,infra] Centralized logging for Toolforge infrastructure logs then raises the question whether Loki should also be deployed to collect logs from Toolforge infrastructure running in Kubernetes.
Constraints and risks
TBD.
Decision record
In progress
Options
Option A
Tackle the infra centralized logging as it's own problem, leave the loki solution for the tools logging (for now).
This means delaying the implementation of the infra centralized logging, and evaluating it by it's own, probably as another decision request with the options and the research on them.
This also means persisting ephemeral logs on the workers to avoid losing them in the meantime (T383081: Persist important toolforge k8s components logs).
Pros:
- No risk of creating a future where Toolforge infrastructure logs are divided in two places depending on whether things run in or out of Kubernetes
- Less effort and less additional system complexity compared to other options
- Less complexity on the toolforge setup (no extra components, no extra buckets/s3 dependencies, etc.)
- Gives time to evaluate the stability/suitability/operationability of loki while being used for tools before using it for the infra
Cons:
- Delayed the centralized logging for Kubernetes components, making searching logs harder (as you might have to go through multiple pods logs) and requiring temporary hacks like T383081: Persist important toolforge k8s components logs
- Delayed the centralized logging for logs from individual VMs
- Kubernetes logs require small workarounds to read when the K8s API is unavailable (simpler setup than the other options though)
Option B
Implement a second Loki instance for Kubernetes infrastructure logs as a part of the wider Loki-for-tools project.
In particular, this requires implementing:
- a second Loki deployment (this can more or less share the Helm values but have some names swapped)
- slightly more complicated Alloy log collector configuration to push infrastructure logs to the second deployment
- a secure way to access the Loki API for admins to query read logs
Pros:
- Centralized logs for everything that runs inside the Kubernetes cluster
- Less starting engineering effort compared to implementing the full solution for all the Toolforge infrastructure logs
- Low additional system complexity, reusing a lot of the tool logs only loki setup
Cons:
- No centralized logging for logs outside k8s
- Risk of creating a non-unified solution for logs from different sources
- Kubernetes logs require various workarounds to read when the K8s API is unavailable, unreadable when s3 is not available
Option C
Immediately prioritize building a separate solution to collect all Toolforge infrastructure logs.
Pros:
- A single, unified system for all Toolforge infrastructure logs
- Past Kubernetes logs still accessible during a K8s API outage
Cons:
- Unknown-sized but still clearly largest amount of engineering effort. Immediately implementing this will likely require de-prioritizing some other work.
- Full metadata about Kubernetes pods generating logs might not be available during cluster API outages [1] (pod and namespace are ok, labels might be missing)
[1]: for example the rsyslog Kubernetes metadata source mentioned in T97861#10957908 still requires access to the Kubernetes API to function properly
Option D
Like option B, using Loki as the central store for the logs.
Pros:
- A single, unified system for all Toolforge infrastructure logs
- Lower overall system complexity (and required engineering effort?) compared to option C, considerably higher than option A (dc: I think that there's a high change that a non-loki based setup will end up simpler than a loki based one)
Cons:
- All logs require various workarounds to read when the K8s API is unavailable, unreadable when s3 is not available
- Higher engineering effort and additional system complexity than option B, considerably higher than option A