Page MenuHomePhabricator

[toolforge,infra] Centralized logging for Toolforge infrastructure logs
Open, MediumPublicFeature

Description

One log to rule them all.

It would be good to have logstash for at least tools-ops logs, which includes

  • basic system logging (dmesg / systemd)
  • k8s logs (maybe not tools, but kube-system/toolforge components at least)
  • toolforge component deployment actions
  • maintain-harbor (that might be logging to the worker's local filesystem currently, we might not need to do that anymore if we can pull the logs directly from k8s, T383081: Persist important toolforge k8s components logs)

Systemd would include most (if not all) the VM services (ssh, puppet, redis, nginx, ...).

Being able then to search through it (kibana style) would help enormously to debug issues, specially for k8s components that might have the logs only in the pods or locally on the workers.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
logging: Deploy tools Loki bucketsrepos/cloud/toolforge/tofu-provisioning!55taavimain-If5631649c079906d83e832a7a92149039289c0e8main
Customize query in GitLab

Event Timeline

valhallasw raised the priority of this task from to Medium.
valhallasw updated the task description. (Show Details)
valhallasw added a project: Toolforge.
valhallasw added subscribers: coren, scfc, Aklapper, yuvipanda.

10:09 valhallasw`cloud: created toolsbeta-logstash to play around with logstash and figure out what we need for tools (phab:T97861)
10:25 valhallasw`cloud: set Hiera variable "elasticsearch::cluster_name": toolsbeta-logstash-eqiad
10:30 valhallasw`cloud: pulled new changes into puppetmaster to get https://github.com/wikimedia/operations-puppet/commit/4afd23d8e2905a84ef211ad92e8314173eb743ba in
10:37 valhallasw`cloud: that doesn't seem to be applied... setting has_ganglia: false manually in wikitech hiera
11:11 valhallasw`cloud: commenting out include ::elasticsearch::ganglia in role::logstash seems to work. I think we have to write our own tools logstash roles anyway in the end, as the role::logstash code contains e.g. mediawiki specific code

Unfortunately, logstash doesn't actually start and crashes with

Errno::EBADF: Bad file descriptor - Bad file descriptor
          close at org/jruby/RubyIO.java:2097
        connect at /opt/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:173
           each at org/jruby/RubyArray.java:1613
        connect at /opt/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:139
        connect at /opt/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:406
           call at org/jruby/RubyProc.java:271
          fetch at /opt/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/pool.rb:48
        connect at /opt/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:403
        execute at /opt/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:319
           get! at /opt/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:217
       register at /opt/logstash/lib/logstash/outputs/elasticsearch_http.rb:117
           each at org/jruby/RubyArray.java:1613
   outputworker at /opt/logstash/lib/logstash/pipeline.rb:220
  start_outputs at /opt/logstash/lib/logstash/pipeline.rb:152

Which was because elasticsearch wasn't started. OK, that starts logstash, but that doesn't actually give us an interface yet...

fgiunchedi renamed this task from Provide centralized logging (logstash) to Provide centralized logging (logstash) for Toolforge.Oct 1 2018, 1:13 PM
fgiunchedi removed a project: Cloud-Services.
fgiunchedi subscribed.

Unlinking from T198756 as Toolforge is out of scope for the current goals, though the design/implementation can be equally applied to Toolforge as well.

Random note that, at this point, one of the only multitenant solutions for this kind of thing that is open source seems to be https://grafana.com/docs/loki/latest/overview/

Re-opening this as it isn't really a duplicate. Instead both this and the other one should be under another task.

lmata moved this task from Radar to Inbox on the observability board.
lmata subscribed.

Hello, Is there something for us (o11y) here or should we just stay in the loop for potential collaboration? Subscribing and radar for now.

This is something to discuss and potentially collaborate on. I'll follow-up with you.

dcaro renamed this task from Provide centralized logging (logstash) for Toolforge to [toolforge.infra] Provide centralized logging (logstash) for Toolforge.Feb 21 2024, 10:20 AM
dcaro reopened this task as Open.
dcaro added subscribers: yuvipanda, EBernhardson.
bd808 changed the subtype of this task from "Task" to "Feature Request".Jan 13 2025, 11:13 PM
dcaro renamed this task from [toolforge.infra] Provide centralized logging (logstash) for Toolforge to [toolforge.infra] Provide centralized logging (logstash) for Toolforge platform.Jan 15 2025, 3:05 PM
taavi renamed this task from [toolforge.infra] Provide centralized logging (logstash) for Toolforge platform to [toolforge.infra] Provide centralized logging for Toolforge platform.Jan 15 2025, 3:10 PM

Just so I understand the relationship between this and T127367... this is specifically about logs for admins/infra components, and T127367 is about logs for tools themselves?

Just so I understand the relationship between this and T127367... this is specifically about logs for admins/infra components, and T127367 is about logs for tools themselves?

Yep, it might not have started like that but currently that's the split. This one is for the toolforge platform itself (used by toolforge roots), the other task is for the tools, used by toolforge users.

dcaro renamed this task from [toolforge.infra] Provide centralized logging for Toolforge platform to [toolforge,infra] Provide centralized logging for Toolforge platform.Feb 12 2025, 10:15 PM
taavi renamed this task from [toolforge,infra] Provide centralized logging for Toolforge platform to [toolforge,infra] Cntralized logging for Toolforge infrastructure logs.May 23 2025, 2:32 PM
taavi removed a project: observability.
dcaro renamed this task from [toolforge,infra] Cntralized logging for Toolforge infrastructure logs to [toolforge,infra] Centralized logging for Toolforge infrastructure logs.Jun 25 2025, 8:15 AM

@taavi hey, can you update this task with your plans on using loki for this? And how does it fit in the overall picture for centralized infra logs?

I'm concerned that maybe loki is not simple/stable enough to use for infra things, specially as we would want to use those when k9s fails/has issues xd

Yes. My plan is to feed logs of everything running inside Kubernetes cluster itself to a Loki instance hosted in there. That primarily includes the various components (which need the cluster to work to function in the first place, plus have a bunch of metadata from the cluster API that we want to include in the logs again placing a dependency on the K8s api to work for any solution).

It doesn't give a general solution for logging things that run outside the cluster (nor for the cluster infrastructure itself), but it doesn't make the logging solution for that any worse. And I think that those two problems will need relatively different solutions anyways, for example VM logs IHMO should be handled via the rsyslog daemon running on each VM (which in fact doesn't work for Loki).

Hmm, does this mean that you foresee having two centralized places for logs? It's better than not having any, but I would prefer having only one place xd

As far as I understand, having container logs from k8s pushed into rsyslog is actually possible (probably as a nicer implementation of T383081: Persist important toolforge k8s components logs), can't we use that then to log everything we want there instead of having two solutions?

I'm not saying that we should not split it in two, but I would like to understand more in detail the rationale before doing so, as I think that it loses some of it's value and potentially complicates the system (by having two very different ways of collecting, storing and exposing logs).

I agree that in an ideal world with infinite engineering resources, it would be great to have some system to collect all the Toolforge infrastructure logs, both from hosts and from in K8s, instead of separate systems for that. Unfortunately we do not currently live in such an ideal world.

I would much prefer to re-use the work I'm doing for tool logging in Loki here to get at least a partial solution that gives us centralized logging for a major, well-defined part of our operational logs (which also happens to be the part where the logs are most quickly lost if you do nothing, as that ticket demonstrates), instead of doing nothing now and having absolutely no central logging until we engineer a full solution that catches all the logs.

(T383081: Persist important toolforge k8s components logs seems like a partial dupe of this, so unless you have objections I'll merge that here!)

I'm not convinced that we are in such dire situation for the extra complexity to be worth maintaining, can we discuss it in a decision request?

That'd help both make sure we explore the major options and consequences, and align ourselves.

In the meantime, this does not have to stop or modify the work on the tools logging, it should be independent. So we can continue with that, and it will also help to get experience with loki and the setup and inform the infra decision.

T383081: Persist important toolforge k8s components logs is a temporary hack to persist to some extent the logs of some of the critical components we have, it's not meant to be a centralized solution in any way (it just persists the logs is syslog of the worker/controller, nothing else), so I think it's not really a duplicate.

taavi removed taavi as the assignee of this task.Aug 29 2025, 1:37 PM
taavi subscribed.

Un-claiming for now.