Page MenuHomePhabricator

[toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge tools
Open, HighPublic

Description

Currently, tools just default to writing log files on to NFS. While simple, this causes a number of problems:

  • It adds additional load on our NFS server, which isn't already doing great
  • Logrotate is a PITA with NFS
  • It's not accessible from the API, that prevents using it in other places (like Toolforge UI)
  • K8s container logs are only available for a limited time

What

A service, exposed through the toolforge API (https://api-docs.toolforge.org), and implemented for now as a cli subcommand (toolforge logs) that allows to:

  • Get the logs for one of your jobs
  • Get the logs for your webservice
  • Get the logs for your scheduled jobs (have to think how to expose different runs)
  • Get the logs for your one-off jobs
  • Get the logs for your builds
  • In addition to persisting the logs, we need to persist /what logs/ are persisted and where they came from, maybe a table relating a given log to a given container/job + access status.
  • There should be a retention policy, probably based on size, more than age of the logs (or both, but if possible not only age).
  • Should be extensible enough to allow future use cases like:
    • Add platform-level logs (your webservice was deployed kind of things, or failed to build), maybe some kind of ‘activity’ logs for your tool (similar to horizon activity tab for an instance)
  • Ideally should be easy/transparent for users on how to use (if we can just pipe the k8s pods output there, awesome), but should also allow to push custom logs (for future usecases, like the platform ones mentioned above)
  • It should only allow members of a specific tool to access the tool logs (see how other services authenticate using the api gateway for authentication user<->api)
What not
  • Not handling logs persisted in the tools home (NFS), only logs generated from the run in k8s
  • Supporting access to the logs as a filesystem (be tha NFS or anywhere) is out of scope, only through the cli

Access to any other UI (ex. Grafana, kibana, greylog, elasticsearch…) is secondary, and not necessary for this implementation. If available and easy to implement, it should be clearly shown (notice banner or something) that it is not stable, and should not be relied on. Using the api and toolforge logs is the future-proof way of accessing your tool’s logs (and sufficient for most use cases, our current focus). Like kubectl logs, it might be that we upgrade k8s and the command might change, get deprecated, or just not work anymore without notice.

API structure

There’s some “common standards” that we use for toolforge APIs

  • Generate the code from the openapi definition or the definition from the code, but not manually maintain it (essentially, copy-paste the boilerplate from one of the existing ones, components-api/fastapi or envvars-api/golang), that will also bring ci stuff and deployment boilerplate
  • All APIs have the endpoints
    • /v1/healthz endpoint used from the api-gateway
    • /v1/metrics endpoint used for prometheus metrics
    • /openapi.json endpoint used for the api-gateway
  • All endpoints have the prefix /<version>/tool/<toolname>/…, for example /v1/tool/wm-lol/builds (see https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/blob/main/openapi/openapi.yaml?ref_type=heads), the api-gateway adds then another prefix /<api_name> to it (see https://api-docs.toolforge.org)
  • All endpoints except /openapi.json and /v1/metrics return a json responded, wrapped in an object {“messages”: {“info”: [], “warning”: [], “error”: []}, …} where the messages there are meta-messages (‘deployment started correctly’, ‘api endpoint deprecated’, ‘unable to find deployment’, …), the cli automatically shows those in colors and such.
How

It should be part of the toolforge set of services, this means it should be deployed in a similar way (using toolforge-deploy), using repos under the toolforge group in gitlab, similar CI/testing/etc. that the other toolforge services use.

It should be preferably deployable in lima-kilo, too, as much as it makes sense (up to the implementer on where to put that line, ex. just adding local storage, vs adding local s3 integration, ...).

K8s specifics

K8s stores logs per container: dir per pod, dir per per container, logfile inside /var/log/pods – users can only see these via ‘toolforge webservice logs/toolforge jobs logs’ which does ‘kubectl logs’

Other related tasks:

Details

Related Objects

StatusSubtypeAssignedTask
Resolved Bstorm
Resolved GTirloni
Resolved Bstorm
Resolved Bstorm
DeclinedNone
Opentaavi
DeclinedNone
ResolvedFeaturetaavi
Resolvedtaavi
ResolvedFeaturetaavi
OpenNone
OpenNone
Resolvedtaavi
Resolvedtaavi
OpenFeatureRaymond_Ndibe
Resolvedtaavi
Resolvedtaavi
Resolvedtaavi
OpenNone
Resolvedtaavi
Resolvedtaavi
ResolvedAndrew
Resolveddcaro
ResolvedNone
ResolvedAndrew

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
dcaro renamed this task from Provide modern, non-NFS error log solution for Toolforge webservices and bots to [toolforge] Provide modern, non-NFS log solution for Toolforge webservices and bots.Feb 21 2024, 10:18 AM
dcaro renamed this task from [toolforge] Provide modern, non-NFS log solution for Toolforge webservices and bots to [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots.Mar 5 2024, 4:10 PM

For anyone curious

for namespace in $(kubectl get ns | tail -n +2 | awk '{print $1}') ;
do
    for pod in $(kubectl get pods -n ${namespace} | tail -n +2 | awk '{print $1}') ;
    do
        kubectl -n ${namespace} logs ${pod} --all-containers --since=24h
    done
done

Suggests that there are about 500 megabytes of logs in the last twenty four hours. Suggesting that a monolithic loki could work.

For anyone curious

for namespace in $(kubectl get ns | tail -n +2 | awk '{print $1}') ;
do
    for pod in $(kubectl get pods -n ${namespace} | tail -n +2 | awk '{print $1}') ;
    do
        kubectl -n ${namespace} logs ${pod} --all-containers --since=24h
    done
done

Suggests that there are about 500 megabytes of logs in the last twenty four hours. Suggesting that a monolithic loki could work.

I fear that this query is missing most of the logs. It ignores most toolforge-jobs jobs that are configured to log onto an NFS file, as well as any cron jobs that are not running at that specific time.

Fair enough, do you have any estimate on how much those logs would account for in a day?

dcaro renamed this task from [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots to [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge tools.Jan 8 2025, 9:11 AM
dcaro updated the task description. (Show Details)

Just updated a bit the task requirements to reflect the current status of toolforge, we might want to create a proposal on how to expose the service to users (what cli commands to implement, api calls, what to show, etc.), let me know if you want me to bootstrap it

Just out of curiosity, what's wrong with setting up a syslog server and letting all the tools log to it?

Just out of curiosity, what's wrong with setting up a syslog server and letting all the tools log to it?

Toolforge is a multi-user, multi-tenant environment where there is no trust guarantee between the users. If everyone could see the logs from every tool there is a non-trivial chance that secrets would be exposed. It is not uncommon for secrets to end up in stack traces from application crashes as an example.

hey @rook I see some activity happening on T386480: [o11y,logging,infra] Deploy Loki to store Toolforge tool log data, and I'm curious about the architecture of this system, and have a few questions.

  • Most notably promtail is now deprecated in favor of grafana alloy
    • this implies a significant architecture change: from a sidecar container approach, to a centralized "log discovery & scraping" approach
    • this may be a good thing, as the notes from Taavi mentioned the need to create a mutating webhook, which would no longer exists :-)
  • Storage: I'm assuming openstack S3 will be used as the destination for loki to store logs
    • Would you create a bucket per tool? or a single bucket for all tools? Do you have any thoughts on this?

I had read those notes, though no haven't considered them too much at this point. Currently only working to get loki into lima-kilo to see what happens in a live environment

  • Most notably promtail is now deprecated in favor of grafana alloy
    • this implies a significant architecture change: from a sidecar container approach, to a centralized "log discovery & scraping" approach
    • this may be a good thing, as the notes from Taavi mentioned the need to create a mutating webhook, which would no longer exists :-)
  • Storage: I'm assuming openstack S3 will be used as the destination for loki to store logs
    • Would you create a bucket per tool? or a single bucket for all tools? Do you have any thoughts on this?

Haven't considered this part thus far no. Current focus is getting loki into lima-kilo which I don't believe supports s3, no?

Indeed I was working off of a different set of loki documentation https://grafana.com/docs/loki/latest/get-started/deployment-modes/ that suggests monolithic is fine up to 20gb/day, which it is unclear to me that we process more of that. Though at this point the intention is only to get a prototype working in lima-kilo, as such can easily be changed should the deployment install have different requirements.

All in all my hope is to get the project moving. As there has been sporadic discussion over the last nine years that this ticket has been open, but of course things have changed in that time and recent opinions are needed. Considering that my patches caught your attention and commentary it appears that I was successful :)

Haven't considered this part thus far no. Current focus is getting loki into lima-kilo which I don't believe supports s3, no?

It would be possible to connect to s3 from loki in lima-kilo if we wanted (ex. creating buckets on toolsbeta repo, or a specific project for lima-kilo), though I'd start without that connection and add it only later if needed.
Maybe test it manually once before deploying on toolsbeta if it makes it easier to develop, but not enabled by default in lima-kilo.

Indeed I was working off of a different set of loki documentation https://grafana.com/docs/loki/latest/get-started/deployment-modes/ that suggests monolithic is fine up to 20gb/day, which it is unclear to me that we process more of that. Though at this point the intention is only to get a prototype working in lima-kilo, as such can easily be changed should the deployment install have different requirements.

+1 on starting simple and moving to the distributed deployment when we find the simple one is not enough (given our current guess), unless you find that using the monolithic one prevents us from using the distributed one somehow.

Haven't considered this part thus far no. Current focus is getting loki into lima-kilo which I don't believe supports s3, no?

We don't have S3 in lima-kilo, but there are a number of options to emulate a S3 endpoint for development purposes (minio, openstack swift, localstack, https://s3ninja.net/ etc)

I guess the right choice depends for lima-kilo on how the store will be used, and how we can better represent the final architecture in production Toolforge. (i.e, if we are going to use openstack swift in production, it may make sense to introduce swift to lima-kilo).

I think I'm feeling inclined towards each tool having a dedicated S3 bucket for logs. Feels easier for us, to control quotas, to implement whatever multi-tenancy controls, to do management etc.

All in all my hope is to get the project moving. As there has been sporadic discussion over the last nine years that this ticket has been open, but of course things have changed in that time and recent opinions are needed. Considering that my patches caught your attention and commentary it appears that I was successful :)

As you figured, installing loki was the easy part. Doing the rest of the architecture may be a bit more challenging :-P

fwiw, while I haven't read all of the recent discussion, I'm still very interested in this project and would be happy to spend some time on it. Not touching anything atm since I don't want to step on anyone's toes since you all seem to already be working on it, but please do let me know if there's some way I can be helpful here

We discussed loki storage a bit, it seems like we can get a decent prototype in lima-kilo with https://grafana.com/docs/loki/latest/operations/storage/tsdb/ and a local filesystem; does that seem like an OK place to start?

Then we can either create a toy s3 implementation in lima-kilo or just add a switch that uses a different backend in tools/toolsbeta.

We discussed loki storage a bit, it seems like we can get a decent prototype in lima-kilo with https://grafana.com/docs/loki/latest/operations/storage/tsdb/ and a local filesystem; does that seem like an OK place to start?

Then we can either create a toy s3 implementation in lima-kilo or just add a switch that uses a different backend in tools/toolsbeta.

I have been thinking about this. What if a design & architecture document is created before building a prototype?

This can be a wikitech page (inspiration), which describes every aspect of the logging system we are going to build, and how the different pieces of software will be glued and work together.

Some of the questions I think are important:

  • How is storage going to be implemented (s3 per-tool bucket, per-job bucket? shared bucket for all tools?)
  • How will multi-tenancy security & isolation be implemented in the ingestion, storage and consumption sides? Ex. What controls will be in place to prevent tool-X from reading logs owned by tool-Y.
  • How ingestion is going to be configured. I mentioned here https://phabricator.wikimedia.org/T127367#10562687 a sidecar container approach, to a centralized "log discovery & scraping" approach
  • Will we need any new admission webhooks?
  • What changes will be needed for other Toolforge components? For example, will jobs-api need to inject additional config for jobs being created? if so, what?

In my opinion, this design & architecture phase should be completed before any implementation work is started. The work that needs to be done at this stage is this design.

In my opinion, this design & architecture phase should be completed before any implementation work is started. The work that needs to be done at this stage is this design.

There's a balance to be struck between defining too much and not defining enough. Currently we have some idea of what tools can be used for this (ex. loki) to start trying to create a POC, even if implementation details are not yet clear.

In that regard and at this point, I think that going forward with the POC will answer those questions (and more!), in a more effective way than just reading documentation and gluing all the theoretical pieces together in a wiki.

Similar to what was done by Taavi to figure out what's in the wiki page he wrote.

That does not mean that everything is figured out, or that the POC will be shipped to production as is, it just means that the POC will help to learn by doing, and avoid some theoretical back-and-forth of possibilities and leave only the ones that are actually doable (iterate instead of doing a big investment up front).

Note that that is just my opinion, I'm ok if @rook prefers going with the extra definition route instead.

Per the last cloud-services-team meeting, the rough plan here is:

  • Finish deploying Loki to store all the Kubernetes pod output created by tools (this is T386480)
  • Swap jobs-api to query logs from Loki instead of from Kubernetes (this is T398645)
  • Then we can remove support for file logging from jobs-api and migrate existing tools using it to Loki.

The unified logging service currently described in the task description is not in scope for the initial implementation.

Restoring the unified logging service description, that should not be lost, and should be tackled soon right after (not in the initial implementation, but as the next step)

group_203_bot_f4d95069bb2675e4ce1fff090c1c1620 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1001

api-gateway: bump to 0.0.81-20251016082112-1c4f5a64

Change #1197587 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] p:toolforge::prometheus: add logs api

https://gerrit.wikimedia.org/r/1197587

Change #1197587 merged by David Caro:

[operations/puppet@production] p:toolforge::prometheus: add logs api

https://gerrit.wikimedia.org/r/1197587