Page MenuHomePhabricator

Decision request - Reuse toolforge user tools central logging for toolforge infrastructure logging
Closed, ResolvedPublic

Description

Problem

T386480: [o11y,logging,infra] Deploy Loki to store Toolforge tool log data is setting up a Grafana Loki instance to store logs from Toolforge tool-owned jobs. Loki is used for that problem because of its multi-tenant capabilities and otherwise generally simple deployment, that part is not controversial here.

However, T97861: [toolforge,infra] Centralized logging for Toolforge infrastructure logs then raises the question whether Loki should also be deployed to collect logs from Toolforge infrastructure running in Kubernetes.

Constraints and risks

TBD.

Decision record

In progress

Options

Option A

Tackle the infra centralized logging as it's own problem, leave the loki solution for the tools logging (for now).

This means delaying the implementation of the infra centralized logging, and evaluating it by it's own, probably as another decision request with the options and the research on them.

This also means persisting ephemeral logs on the workers to avoid losing them in the meantime (T383081: Persist important toolforge k8s components logs).

Pros:

  • No risk of creating a future where Toolforge infrastructure logs are divided in two places depending on whether things run in or out of Kubernetes
  • Less effort and less additional system complexity compared to other options
  • Less complexity on the toolforge setup (no extra components, no extra buckets/s3 dependencies, etc.)
  • Gives time to evaluate the stability/suitability/operationability of loki while being used for tools before using it for the infra

Cons:

  • Delayed the centralized logging for Kubernetes components, making searching logs harder (as you might have to go through multiple pods logs) and requiring temporary hacks like T383081: Persist important toolforge k8s components logs
  • Delayed the centralized logging for logs from individual VMs
  • Kubernetes logs require small workarounds to read when the K8s API is unavailable (simpler setup than the other options though)

Option B

Implement a second Loki instance for Kubernetes infrastructure logs as a part of the wider Loki-for-tools project.

In particular, this requires implementing:

  • a second Loki deployment (this can more or less share the Helm values but have some names swapped)
  • slightly more complicated Alloy log collector configuration to push infrastructure logs to the second deployment
  • a secure way to access the Loki API for admins to query read logs

Pros:

  • Centralized logs for everything that runs inside the Kubernetes cluster
  • Less starting engineering effort compared to implementing the full solution for all the Toolforge infrastructure logs
  • Low additional system complexity, reusing a lot of the tool logs only loki setup

Cons:

  • No centralized logging for logs outside k8s
    • Risk of creating a non-unified solution for logs from different sources
  • Kubernetes logs require various workarounds to read when the K8s API is unavailable, unreadable when s3 is not available

Option C

Immediately prioritize building a separate solution to collect all Toolforge infrastructure logs.

Pros:

  • A single, unified system for all Toolforge infrastructure logs
  • Past Kubernetes logs still accessible during a K8s API outage

Cons:

  • Unknown-sized but still clearly largest amount of engineering effort. Immediately implementing this will likely require de-prioritizing some other work.
  • Full metadata about Kubernetes pods generating logs might not be available during cluster API outages [1] (pod and namespace are ok, labels might be missing)

[1]: for example the rsyslog Kubernetes metadata source mentioned in T97861#10957908 still requires access to the Kubernetes API to function properly

Option D

Like option B, using Loki as the central store for the logs.

Pros:

  • A single, unified system for all Toolforge infrastructure logs
  • Lower overall system complexity (and required engineering effort?) compared to option C, considerably higher than option A (dc: I think that there's a high change that a non-loki based setup will end up simpler than a loki based one)

Cons:

  • All logs require various workarounds to read when the K8s API is unavailable, unreadable when s3 is not available
  • Higher engineering effort and additional system complexity than option B, considerably higher than option A

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

[1]: for example the rsyslog solution mentioned in T97861#10957908 still requires access to the Kubernetes API to function properly

I'm not sure it does, it pulls the logs directly from the container log files in the filesystem, it does not talk to k8s. It does not depend (as is at least) on S3, or any other k8s pieces either (ex. if loki deployment can't restart pods because it can't pull images, or because a webhook messes it up, nodes unhealthy, or essentially any other k8s related issue).

Can we add another option, E?

Prioritize persisting the logs of the most critical k8s components (that is just piping to journal on the workers, not centralizing yet), and do the infra logs centralization later. That will give us a quick fix for now for the most critical logs (not awesome, but good enough imo), and learn from the loki setup for tools to take a better decision for the infra centralized setup later.

[1]: for example the rsyslog solution mentioned in T97861#10957908 still requires access to the Kubernetes API to function properly

I'm not sure it does, it pulls the logs directly from the container log files in the filesystem, it does not talk to k8s. It does not depend (as is at least) on S3, or any other k8s pieces either (ex. if loki deployment can't restart pods because it can't pull images, or because a webhook messes it up, nodes unhealthy, or essentially any other k8s related issue).

The documentation I linked says that the rsyslog Kubernetes metadata module seems to require connection details to the Kubernetes API. Is that documentation wrong? Or is that metadata somehow not necessary?

Can we add another option, E?

Prioritize persisting the logs of the most critical k8s components (that is just piping to journal on the workers, not centralizing yet), and do the infra logs centralization later. That will give us a quick fix for now for the most critical logs (not awesome, but good enough imo), and learn from the loki setup for tools to take a better decision for the infra centralized setup later.

So by this do you mean merging the patches in T383081 to persist maintain-harbor logs? (As far as I'm aware, that's the only infrastructure cronjob we have.) Or do you see value in duplicating all infrastructure container logs to the journal regardless of whether they're kept by the container runtime already?

The documentation I linked says that the rsyslog Kubernetes metadata module seems to require connection details to the Kubernetes API. Is that documentation wrong? Or is that metadata somehow not necessary?

Missed that link, In my comment on the other task I was not specifically suggesting that solution though, just a generic solution. On those patches you can see that the namespace and the pod can be easily extracted from the path to the log itself, so as a start I think it might be good enough. The rest of the metadata we can investigate how to get it, if we want to get anything more. Might also be secondary, as in "get it if you can, but fallback to just adding the namespace+pod", to avoid depending on the k8s api at all.

So by this do you mean merging the patches in T383081 to persist maintain-harbor logs? (As far as I'm aware, that's the only infrastructure cronjob we have.) Or do you see value in duplicating all infrastructure container logs to the journal regardless of whether they're kept by the container runtime already?

Those for sure, and any others we might want to persist too if we fear losing them. Usually they are stored already for some time, so running deployments and such should not be critical imo (they also stay in the filesystem for a bit), but I'm happy to review which ones we want persisted if you have other concerns.

Currently I don't see any value on duplicating them no. It would be quite convenient though if we end up pulling the logs from the journal to whichever central place we store them in, whenever we have that central place (though we could also instead send them directly from the container logs).

The documentation I linked says that the rsyslog Kubernetes metadata module seems to require connection details to the Kubernetes API. Is that documentation wrong? Or is that metadata somehow not necessary?

Missed that link, In my comment on the other task I was not specifically suggesting that solution though, just a generic solution. On those patches you can see that the namespace and the pod can be easily extracted from the path to the log itself, so as a start I think it might be good enough. The rest of the metadata we can investigate how to get it, if we want to get anything more. Might also be secondary, as in "get it if you can, but fallback to just adding the namespace+pod", to avoid depending on the k8s api at all.

Thanks, fixed in T398285#10963016.

So by this do you mean merging the patches in T383081 to persist maintain-harbor logs? (As far as I'm aware, that's the only infrastructure cronjob we have.) Or do you see value in duplicating all infrastructure container logs to the journal regardless of whether they're kept by the container runtime already?

Those for sure, and any others we might want to persist too if we fear losing them. Usually they are stored already for some time, so running deployments and such should not be critical imo (they also stay in the filesystem for a bit), but I'm happy to review which ones we want persisted if you have other concerns.

Currently I don't see any value on duplicating them no. It would be quite convenient though if we end up pulling the logs from the journal to whichever central place we store them in, whenever we have that central place (though we could also instead send them directly from the container logs).

Ack. I don't see much value in persisting logs of existed deployments. Please do feel free to add option E to the description, or just fold that to A if the stopgap solution for maintain-harbor is already being worked on.

dcaro renamed this task from Decision request - Toolforge infrastructure logging to Decision request - Reuse toolforge user tools central logging for toolforge infrastructure logging.Jul 1 2025, 2:15 PM
dcaro updated the task description. (Show Details)
dcaro updated the task description. (Show Details)

Updated the task, reworded some stuff too to reflect the new option A differences and details

taavi triaged this task as Medium priority.Jul 9 2025, 1:23 PM

I don't have a strong opinion on this, I would probably go for Option A so that we can use Loki only for tools for a while, and later consider extending its usage to infrastructure logs.

Like option B, using Loki as the central store for the logs.

I'm not sure I understand option D: a single Loki instance for both tools logs and infra logs, whereas option B would have two separate Loki instances?

Given finite Taavi availability, A seems like the best option since it's the easiest (and pretty much done.) If there's time to do more then B seems like the obvious next step since it's nice and incremental.

Having separate log solutions for 'prod' and 'user' logs seems pretty normal to me, more likely a feature than a bug. But I may be missing something; does that imply that we eventually have N log solutions, or just 2?

The result from the decision meeting is that we are happy with option A for now but I might also implement option B if I can find the availability for that.

There's no decision record for this?
Do we have the meeting minutes? (can't see them in google calendar)