Page MenuHomePhabricator

Explore monitoring for the GitLab runners k8s cluster
Open, Needs TriagePublic

Description

The goal of this task is to investigate the running prometheus and grafana instances within the GitLab runners k8s cluster on digital ocean. As an output, we document:

  1. What's running
  2. How to access it
  3. What metrics prometheus is collecting
  4. Make a recommendation about what's missing

Event Timeline

brennen edited projects, added GitLab (Infrastructure); removed GitLab.
brennen moved this task from Infrastructure to CI & Job Runners on the GitLab board.

@lmata I can't seem to find it now, but I remember someone mentioning prometheus federation as a possible way to fold in the monitoring we're doing with the nice things you o11y has. Do you all have anyone already doing that?/Do you have thoughts about doing that in this case?

Hi @thcipriani, I've checked with the team, and we're familiar with federation but are unsure about the use case in question. Could you share some more context about the need? If it's useful, we can schedule a quick discussion and dig in further.

Hi @thcipriani, I've checked with the team, and we're familiar with federation but are unsure about the use case in question. Could you share some more context about the need? If it's useful, we can schedule a quick discussion and dig in further.

Sure! There two main types of GitLab runners

  • Trusted runners — physical hosts in our prod network
  • Shared runners – A mix of VMs/kubernetes hosts running untrusted test workloads
    • WMCS runners – VMs on WMCS using docker
    • DigitalOcean runners – a kubernetes cluster that uses the GitLab kubernetes runner

The DigitalOcean runners are running on managed Kubernetes. That K8s has a monitoring stack with prometheus, grafana, and alert manager.

The goal of this task is to be able to closely monitor testing activity on untrusted runners. Right now, our monitoring is fairly blunt, but we have much of the data to build more granular dashboards. We're struggling a bit with alerting from alert manager (duplicating what I imagine you have set up already with SMTP/irc/etc). I was curious about your opinions on federating our prometheus and if that would allow us to tie this information into our standard tooling: if that's possible/desirable and whether that creates more work than it may save.