Page MenuHomePhabricator

Add commands to `webservice` and `jobs` to query logs from Kubernetes
Closed, ResolvedPublic

Description

toolforge webservice and toolforge jobs should have commands that read the pod output logs from Kubernetes, either directly or via the Jobs API.

Event Timeline

Change 916791 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge: install toolforge-logs-cli

https://gerrit.wikimedia.org/r/916791

define $ITEM = [ webservice, job, build, etc]

In https://gitlab.wikimedia.org/repos/cloud/toolforge/logs-cli/-/merge_requests/1 there was a comment by @dcaro questioning where to introduce/couple the logs command. Basically if the show me the logs action should be a property of each item, or an action on its own.
I think this is the right moment to double-think the semantics we would be exposing to the users.

The initial implementation from @taavi is something like:

  • toolforge logs -j <job> [<someoptionalfilter>]
  • toolforge logs -w <webservice> [<someoptionalfilter>]

If I understand correctly what @dcaro was suggesting, this would be a bit different, in the direction of:

  • toolforge job logs <jobname> [<someoptionalfilter>]
  • toolforge webservice logs [<someoptionalfilter>]
  • toolofrge build logs [<someoptionalfilter>]

I get the point by @dcaro. For example for the job case, it would make sense to hit <tab> and see all the subcommands (or actions) you can do for a job: run/show/list/delete/logs etc.

However, one of the ideas I had for a top-level toolforge logs entry point was to collapse the logs for all items in a single place. We don't have to implement everything today, but imagine something similar to journalctl in which you see everything log-related to every item in your tool (webservice, build, jobs) in a time-sorted fashion. At that point, the log is no longer the property of an individual item, but a tool-level thing.

Moreover, we could have both! As I'm writing this, I'm realizing that, for example, logs can we seen in both systemd status <service> and journalctl, and the world is happy with that. So we can have a per-item log action (implicitly filtered to that item), and a tool-level log action (that shows everything unless explicitly filtered).

Thoughts?

BTW: the kubernetes API can show logs with timestamps https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#http-request-2 this means that merging logs events from different items and sorting by timestamp should be doable, even in the --follow case.

The initial implementation from @taavi is something like:

  • toolforge logs -j <job> [<someoptionalfilter>]
  • toolforge logs -w <webservice> [<someoptionalfilter>]

Minor correction: -w does not take an argument. It's theoretically equivalent to -j $TOOL_NAME but that feels like an implementation detail we want to hide from our users.

But otherwise you're pretty much capturing my thoughts about this. I like the idea of a single "give me all the logs" command, but I do see the value on having a logs action available in the toolforge jobs and toolforge webservice hierarchies where we already have other actions available. And even for the latter option I think it makes sense to centralize code in a single repository (which can then be imported to the other CLI components) than developing separate implementations.

What about making all this log logic an API from the get-go, so it should be trivial to just fetch logs from whatever command line (or other client, like a web page).

I had missed this task!

The initial implementation from @taavi is something like:

  • toolforge logs -j <job> [<someoptionalfilter>]
  • toolforge logs -w <webservice> [<someoptionalfilter>]

Minor correction: -w does not take an argument. It's theoretically equivalent to -j $TOOL_NAME but that feels like an implementation detail we want to hide from our users.

But otherwise you're pretty much capturing my thoughts about this. I like the idea of a single "give me all the logs" command, but I do see the value on having a logs action available in the toolforge jobs and toolforge webservice hierarchies where we already have other actions available. And even for the latter option I think it makes sense to centralize code in a single repository (which can then be imported to the other CLI components) than developing separate implementations.

What does "give me all the logs" mean here?

All the logs for all the jobs + all the webservices + all the builds + (any other service we might implement)? Only the logs for the pods? (something the users should not need to understand, as that's an implementation detail) The logs of the jobs (file and pod)?

That seems a bit too broad to me, but I could see that working if you could filter by "application" (job instance/webservice instance/build instance)

I like the idea of the top level logs command, but I think that it should be open to users only after we have a central logging implemented, allowing to reuse the login for each subcommand.

Until then, each subcommand has a different way of doing that, might be slightly different or not, and it's tied to the implementation of the service underneath (pods + file logs + triggering events for jobs, proxy + pod logs for webservice, tekton + pod logs for build) so I think those belong within each subcommand code.

In the future, when we start building the central logging, we can just alias those subcommand log (ex. toolforge webservice logs) to the generic log command (toolforge logs) transparently for the user.

If we start the other way around, we have a period of intertwined code where we are force-coupling the log general subcommand (toolforge logs) to each and every service it fetches logs for (build, jobs, webservice) without getting much in return.

And even for the latter option I think it makes sense to centralize code in a single repository (which can then be imported to the other CLI components) than developing separate implementations.

Most of the shared code here is just how to do kubectl logs, and maybe how to format them out.

I think that should be ok to put in libraries, either ours or (I strongly encourage) upstream ones (ex. kubernetes libraries).

What about making all this log logic an API from the get-go, so it should be trivial to just fetch logs from whatever command line (or other client, like a web page).

For now the way to fetch logs for each service is quite specific to the service implementation, so any change on the service inner working will end up forcing us to pair a change on the log service and deploy in an orchestrated manner (both will be coupled).

Putting emphasis once again that once we have a way to just send logs to the service, and filter them out later (so the log service would not depend on the implementation of the others) this would be the way to go.

Moreover, we could have both! As I'm writing this, I'm realizing that, for example, logs can we seen in both systemd status <service> and journalctl, and the world is happy with that. So we can have a per-item log action (implicitly filtered to that item), and a tool-level log action (that shows everything unless explicitly filtered).

This is a good example of that process implementation. Imagine that journalctl would have to know in advance how and where each service stores it's logs, it would be a mess of code! Instead, the implementation is creating a logging system, that every service uses, and then journalctl just polls that system with the same interface with a common filter interface.

Now each service uses a common API to send it's logs to the journal system, and then journalctl easily uses it to pull them out.

What does "give me all the logs" mean here?

All the logs for all the jobs + all the webservices + all the builds + (any other service we might implement)? Only the logs for the pods? (something the users should not need to understand, as that's an implementation detail) The logs of the jobs (file and pod)?

That seems a bit too broad to me, but I could see that working if you could filter by "application" (job instance/webservice instance/build instance)

I'd start from all logs created from the tool code (= jobs and web services) that is not stored elsewhere like in files. That leaves us with container output logs, i.e. what the tool can currently query from running pods.

I feel like we should not integrate reading logs files from NFS into any logs (sub)commands. They don't have metadata like timestamps that other log stores have, and they already have standard non-Kubernetes-specific tools (tail less cat etc.) to work with.

Build logs feel different enough from logs generated by the tool when it's running that I don't think they should be displayed unless the user explicitely asks for those.

This pretty much means we expose an interface like "send logs to container stdout and have it visible in toolforge logs". Why should it matter whether those logs are stored and queried from the Kubernetes nodes directly or from a separate log storage service?

Until then, each subcommand has a different way of doing that, might be slightly different or not, and it's tied to the implementation of the service underneath (pods + file logs + triggering events for jobs, proxy + pod logs for webservice, tekton + pod logs for build) so I think those belong within each subcommand code.

Pretty much the only difference at this point is which selectors to use to find the Kubernetes pods:

def _create_k8s_selector(selector: LogSelector) -> Dict[str, Any]:
    if selector.webservice:
        return {"app.kubernetes.io/managed-by": "webservice"}
    return {"app.kubernetes.io/name": selector.job_name}

What about making all this log logic an API from the get-go, so it should be trivial to just fetch logs from whatever command line (or other client, like a web page).

By an API do you mean a Python library that can be used by all the CLI tools or a HTTP API (like jobs-api)?

This also makes me wonder if we should make a more general Toolforge Python library to be shared with all of the CLI tools to include stuff like the K8s client class which we've been copy pasting everywhere.

By an API do you mean a Python library that can be used by all the CLI tools or a HTTP API (like jobs-api)?

A REST API, so HTTP like in jobs-api.

This also makes me wonder if we should make a more general Toolforge Python library to be shared with all of the CLI tools to include stuff like the K8s client class which we've been copy pasting everywhere.

Yes, and moreover @dcaro would suggest using the upstream k8s library to build that.

! In T336057#8832987, @taavi wrote:
I'd start from all logs created from the tool code (= jobs and web services) that is not stored elsewhere like in files. That leaves us with container output logs, i.e. what the tool can currently query from running pods.

Yes, that is good, that is the feature that the users don't have anymore, let's build this!
But let's build it in a way that will be easier to maintain and evolve.

! In T336057#8832987, @taavi wrote:
This pretty much means we expose an interface like "send logs to container stdout and have it visible in toolforge logs". Why should it matter whether those logs are stored and queried from the Kubernetes nodes directly or from a separate log storage service?

It does not, it should not, and that is good, I like the abstraction. The problem is that we are coupling that abstraction with the others, when we could instead decouple it, so they can evolve independently. If for whatever reason we change the way webservices works and instead of one pod they have many, or none, then we will have to change this service too, that is how I measure coupling.

So in a way, by having that top-level logs service that is jobs and webservice implementation aware, we are depending on those logs being in kubernetes in a very specific way, something that we could avoid.

By an API do you mean a Python library that can be used by all the CLI tools or a HTTP API (like jobs-api)?

A REST API, so HTTP like in jobs-api.

I mean any system that is not coupled with the other services (so a system by itself). If it's an http api, rest or not does not matter that much.
To the point, that if the jobs api changes the way it's implemented (using more pods, less pods, no pods, etc.) it should not change.

This means that it would be the jobs api that uses the logs system, not the logs system that uses the jobs api.

! In T336057#8832987, @taavi wrote:
Pretty much the only difference at this point is which selectors to use to find the Kubernetes pods:

If you do not take into account any other logs generated by any other service or system except running pods (that again, is a good start, but it's just a start, when we evolve it that will have to change).

side note: A central log system to collect all logs and later expose all them behind a common interface wont happen anytime soon... (Or at least, to my knowledge, we the WMCS haven't planned for it this Q or even next FY).

This also makes me wonder if we should make a more general Toolforge Python library to be shared with all of the CLI tools to include stuff like the K8s client class which we've been copy pasting everywhere.

Yes, and moreover @dcaro would suggest using the upstream k8s library to build that.

The main problem with that is keeping the Debian packages of the upstream library and it's dependencies up-to-date on stable or oldstable. But that's starting to go very off-topic here..

side note: A central log system to collect all logs and later expose all them behind a common interface wont happen anytime soon... (Or at least, to my knowledge, we the WMCS haven't planned for it this Q or even next FY).

I'm hoping to work on that once we have an object storage service. But yes, we should for now assume that it'll take a while.


To summarize (please correct me if I'm interpreting anyone or anything wrong):

  • We want to implement a logging system for buildpack based tools that does not involve logging to files on NFS. The primary way to interact with this system should not be manually using kubectl logs.
  • Buildpack based tools can/will output their logs to standard output of the process running in the pod. Currently Kubernetes exposes an API to read those logs from running containers, and it is possible to build a more persistent logging storage solution later that will use the same interface for tools to output their logs.
  • We like logs subcommands in the existing CLIs, for example webservice logs for querying logs for the web service or toolforge jobs logs <job-name> for logs for a specific toolforge-jobs managed jop.
  • Having a single command to query the logs from all jobs and webservices is a nice-to-have, but it is not a requirement for the initial implementation.
  • We want to avoid any major changes to the interface (both for tool code to write logs, and for the user reading the logs) after the initial implementation, even if/when implementing a new system for log storage (like Loki).
  • We want to avoid code duplication if possible.

Based on those I propose the following way to move forward:

  • We create a new Python library, toolforgelib, for shared code across Toolforge infrastructure components.
  • We move the Kubernetes API client class that we've been copy-pasting around to the new library.
  • In the new library, we also create an interface to query logs based on Kubernetes pod labels. It's currently implemented to query the logs from the Kubernetes API, but it will be possible to swap the implementation underneath when we implement a different log storage system.
  • We implement a webservice logs subcommand to query logs from any webservice managed pods using the new library.
  • We implement a similar subcommand to toolforge-jobs to query logs from pods belonging to a specific toolforge-jobs managed job. I think this means implementing a new method in jobs-api for these requests, instead of having the CLI query Kubernetes directly.

Thoughts?

The main problem with that is keeping the Debian packages of the upstream library and it's dependencies up-to-date on stable or oldstable. But that's starting to go very off-topic here..

That is only if you keep the logic in the client, if you move the logic to the APIs we can use pip to install the upstream at build time ;)
(and our potential own libs too!)

To summarize (please correct me if I'm interpreting anyone or anything wrong):

  • We want to implement a logging system for buildpack based tools that does not involve logging to files on NFS. The primary way to interact with this system should not be manually using kubectl logs.
  • Buildpack based tools can/will output their logs to standard output of the process running in the pod. Currently Kubernetes exposes an API to read those logs from running containers, and it is possible to build a more persistent logging storage solution later that will use the same interface for tools to output their logs.
  • We like logs subcommands in the existing CLIs, for example webservice logs for querying logs for the web service or toolforge jobs logs <job-name> for logs for a specific toolforge-jobs managed jop.

This is mainly a temprorary way of avoiding moving service-specific logic from the subcommands to a top level command. That is also why the toolforge-cli next move is to become a thin agnostic client, and move toolforge-build out if it to it's own as we discussed several times.

  • Having a single command to query the logs from all jobs and webservices is a nice-to-have, but it is not a requirement for the initial implementation.
  • We want to avoid any major changes to the interface (both for tool code to write logs, and for the user reading the logs) after the initial implementation, even if/when implementing a new system for log storage (like Loki).
  • We want to avoid code duplication if possible.

Based on those I propose the following way to move forward:

  • We create a new Python library, toolforgelib, for shared code across Toolforge infrastructure components.
  • We move the Kubernetes API client class that we've been copy-pasting around to the new library.
  • In the new library, we also create an interface to query logs based on Kubernetes pod labels. It's currently implemented to query the logs from the Kubernetes API, but it will be possible to swap the implementation underneath when we implement a different log storage system.
  • We implement a webservice logs subcommand to query logs from any webservice managed pods using the new library.
  • We implement a similar subcommand to toolforge-jobs to query logs from pods belonging to a specific toolforge-jobs managed job. I think this means implementing a new method in jobs-api for these requests, instead of having the CLI query Kubernetes directly.

Thoughts?

This sounds good to me yes 👍

Based on those I propose the following way to move forward:

I agree with the proposal. I think I still want the toolforge logs interface to behave has a bare journalctl and show everything, but I guess the work as proposed wont prevent that from happening in the future.

aborrero triaged this task as Medium priority.May 8 2023, 2:26 PM

A suggestion/request: I don't like the toolforgelib name. It is too generic and may confuse people (including ourselves).

I'd say either pick a more specific name like toolforge-internal-system-library (or similar) or introduce some random keyword of your liking (a planet name, an object, your favorite band, whatever)

I'd say either pick a more specific name like toolforge-internal-system-library (or similar) or introduce some random keyword of your liking (a planet name, an object, your favorite band, whatever)

My 2c, I find random stuff way more confusing than the current name.

Some other names if you want to consider new names (I'm ok with the current):

  • toolforge_plumbing
  • toolforge_common
  • toolforge_internal
  • toolforge_cli_common
  • toolforge_cli_libs

Good point about the current one being maybe too confusing, considering there's already a Python library called toolforge. I'm fine with anything that's not toolforge-internal-system-library (way too long, sorry!). I like toolforge_plumbing, although I wonder if we could pick something related to the forging / blacksmith theme that Striker already uses, that would combine the 'descriptive name' with a toolforge_ prefix and a random semi-related word to prevent confusion and also prevent the name becoming outdated in the future. Some ideas:

  • toolforge_weld (you can combine the library with your code to more easily build things)

+1 to this one.

I like the weld on, maybe toolforge_welding better? (things you use to weld stuff)

For whatever reason I like plain toolforge_weld better so I'm going to go with it.

Change 917824 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/software/tools-webservice@master] Use toolforge_weld

https://gerrit.wikimedia.org/r/917824

Change 917825 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/software/tools-webservice@master] Add logs action

https://gerrit.wikimedia.org/r/917825

Change 917824 merged by jenkins-bot:

[operations/software/tools-webservice@master] Use toolforge_weld

https://gerrit.wikimedia.org/r/917824

Change 916791 abandoned by Majavah:

[operations/puppet@production] P:toolforge: install toolforge-logs-cli

Reason:

going with a different approach, see task

https://gerrit.wikimedia.org/r/916791

Change 917825 merged by jenkins-bot:

[operations/software/tools-webservice@master] Add logs action

https://gerrit.wikimedia.org/r/917825

Change 921412 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-cli@master] jobs-framework-cli: use toolforge-weld

https://gerrit.wikimedia.org/r/921412

Change 921412 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-cli@master] jobs-framework-cli: use toolforge-weld

https://gerrit.wikimedia.org/r/921412

Change 963295 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/toolforge/jobs-framework-cli@master] Add support for querying logs

https://gerrit.wikimedia.org/r/963295

taavi renamed this task from Build CLI tool for querying tool logs to Add commands to `webservice` and `jobs` to query logs from Kubernetes.Oct 4 2023, 12:20 PM
taavi updated the task description. (Show Details)
taavi moved this task from Next Up to In Review on the Toolforge (Toolforge iteration 00) board.

Change 963295 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-cli@master] Add support for querying logs

https://gerrit.wikimedia.org/r/963295