Explore monitoring for the GitLab runners k8s cluster
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	thcipriani
	May 1 2024, 4:05 PM

Description

The goal of this task is to investigate the running prometheus and grafana instances within the GitLab runners k8s cluster on digital ocean. As an output, we document:

What's running
How to access it
What metrics prometheus is collecting
Make a recommendation about what's missing

Related Objects
Search...

Status	Assigned	Task
Open	None	T363919 Explore monitoring for the GitLab runners k8s cluster
In Progress	• brennen	T373548 Alerts for disk space issues in GitLab's Digital Ocean cluster
Open	None	T380680 Random GitLab CI failures with ContainersNotReady

Event Timeline

thcipriani created this task.May 1 2024, 4:05 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 1 2024, 4:05 PM

Aklapper added a project: observability.May 1 2024, 4:30 PM

thcipriani moved this task from Backlog to Ready on the Release-Engineering-Team (Yakisfaction) board.May 1 2024, 4:39 PM

lmata moved this task from Inbox to Radar on the observability board.May 2 2024, 2:04 PM

thcipriani edited projects, added Release-Engineering-Team (Priority Backlog 📥); removed Release-Engineering-Team (Yakisfaction).May 22 2024, 4:48 PM

• brennen moved this task from Inbox to Infrastructure on the GitLab board.Jun 26 2024, 8:36 PM

• brennen edited projects, added GitLab (Infrastructure); removed GitLab.

• brennen moved this task from Infrastructure to CI & Job Runners on the GitLab board.

• brennen edited projects, added GitLab (CI & Job Runners); removed GitLab (Infrastructure).

thcipriani added a subtask: T373548: Alerts for disk space issues in GitLab's Digital Ocean cluster.Sep 11 2024, 5:18 PM

@lmata I can't seem to find it now, but I remember someone mentioning prometheus federation as a possible way to fold in the monitoring we're doing with the nice things you o11y has. Do you all have anyone already doing that?/Do you have thoughts about doing that in this case?

• brennen changed the status of subtask T373548: Alerts for disk space issues in GitLab's Digital Ocean cluster from Open to In Progress.Sep 11 2024, 10:51 PM

Hi @thcipriani, I've checked with the team, and we're familiar with federation but are unsure about the use case in question. Could you share some more context about the need? If it's useful, we can schedule a quick discussion and dig in further.

In T363919#10142129, @lmata wrote:

Hi @thcipriani, I've checked with the team, and we're familiar with federation but are unsure about the use case in question. Could you share some more context about the need? If it's useful, we can schedule a quick discussion and dig in further.

Sure! There two main types of GitLab runners

Trusted runners — physical hosts in our prod network
Shared runners – A mix of VMs/kubernetes hosts running untrusted test workloads
- WMCS runners – VMs on WMCS using docker
- DigitalOcean runners – a kubernetes cluster that uses the GitLab kubernetes runner

The DigitalOcean runners are running on managed Kubernetes. That K8s has a monitoring stack with prometheus, grafana, and alert manager.

The goal of this task is to be able to closely monitor testing activity on untrusted runners. Right now, our monitoring is fairly blunt, but we have much of the data to build more granular dashboards. We're struggling a bit with alerting from alert manager (duplicating what I imagine you have set up already with SMTP/irc/etc). I was curious about your opinions on federating our prometheus and if that would allow us to tie this information into our standard tooling: if that's possible/desirable and whether that creates more work than it may save.

Jelto subscribed.Sep 27 2024, 8:53 AM

Jelto added a subtask: T380680: Random GitLab CI failures with ContainersNotReady.Nov 25 2024, 4:17 PM

Explore monitoring for the GitLab runners k8s clusterOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Explore monitoring for the GitLab runners k8s cluster
Open, Needs TriagePublic
Actions

Related Objects
Search...