Page MenuHomePhabricator

Create a custom GitLab Prometheus exporter
Closed, ResolvedPublic

Description

GitLab Omnibus packages a integrated Prometheus exporter. This exporter contains basic GitLab metrics but lacks detailed metrics. This was also mentioned in T347038 for additional gitlab-runner metrics.

Beside more gitlab-runner metrics (T347038) we identified more use cases like alerting for project sizes or Trusted Runner config changes (T353271).

So we should evaluate how we can create our own custom exporter which exports a small number of roughly 5-10 metrics. We can start with:

  • Protection flag of runners (ref_protected or unprotected)
  • size of projects (repo, artifacts, packages)

It's likely that we find additional use-cases or events which should alert. So some kind of extendible architecture would be good.

Event Timeline

LSobanski moved this task from Incoming to Backlog on the collaboration-services board.

A first metric is fetched successfully in the exporter:

curl -s 127.0.0.1:8000 | grep gitlab_projects_total
# HELP gitlab_projects_total Total number of GitLab project
# TYPE gitlab_projects_total gauge
gitlab_projects_total{instance="gitlab.devtools.wmcloud.org"} 89.0

See https://gitlab.wikimedia.org/repos/sre/gitlab-exporter.

GitLab Runner configuration values are available now in the exporter:

# HELP gitlab_runners Total number of GitLab runners
# TYPE gitlab_runners gauge
gitlab_runners{instance="gitlab.devtools.wmcloud.org"} 3.0
# HELP gitlab_runners_up Status of gitlab-runner
# TYPE gitlab_runners_up gauge
gitlab_runners_up{access_level="not_protected",description="Shared Runners, running in Wikimedia Cloud Services",id="76",instance="gitlab.devtools.wmcloud.org",locked="False",runner_type="group_type",tag_list="['wmcs']"} 1.0
gitlab_runners_up{access_level="ref_protected",description="Trusted Runners",id="77",instance="gitlab.devtools.wmcloud.org",locked="True",runner_type="project_type",tag_list="['trusted']"} 1.0
gitlab_runners_up{access_level="ref_protected",description="Trusted Dockerfile Runners",id="79",instance="gitlab.devtools.wmcloud.org",locked="True",runner_type="project_type",tag_list="['trusted', 'dockerfile']"} 1.0

Most interesting are the labels access_level, locked and tag_list which we can use for alerting when runner config changes.

Change #1027234 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: add option to run a custom exporter

https://gerrit.wikimedia.org/r/1027234

Change #1027238 had a related patch set uploaded (by Jelto; author: Jelto):

[labs/private@master] gitlab: add dummy token for exporter

https://gerrit.wikimedia.org/r/1027238

Change #1027238 merged by Jelto:

[labs/private@master] gitlab: add dummy token for exporter

https://gerrit.wikimedia.org/r/1027238

Change #1027234 merged by Jelto:

[operations/puppet@production] gitlab: add option to run a custom exporter

https://gerrit.wikimedia.org/r/1027234

The exporter runs on the test instance now. I'll enable the exporter on the prod machines and add them to Prometheus next week.

Change #1029168 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: enable custom exporter on all instances

https://gerrit.wikimedia.org/r/1029168

Change #1029169 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] prometheus::ops: scrape custom gitlab exporter

https://gerrit.wikimedia.org/r/1029169

Change #1029168 merged by Dzahn:

[operations/puppet@production] gitlab: enable custom exporter on all instances

https://gerrit.wikimedia.org/r/1029168

Change #1031822 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: bump exporter version to v1.0.3

https://gerrit.wikimedia.org/r/1031822

Change #1031822 merged by Jelto:

[operations/puppet@production] gitlab: bump exporter version to v1.0.3

https://gerrit.wikimedia.org/r/1031822

Change #1032414 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: bump exporter version to v1.0.4

https://gerrit.wikimedia.org/r/1032414

Change #1032414 merged by Jelto:

[operations/puppet@production] gitlab: bump exporter version to v1.0.4

https://gerrit.wikimedia.org/r/1032414

Change #1034494 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: bump exporter version to v1.0.5

https://gerrit.wikimedia.org/r/1034494

Change #1034494 merged by Jelto:

[operations/puppet@production] gitlab: bump exporter version to v1.0.5

https://gerrit.wikimedia.org/r/1034494

The GitLab production instances use a old package of python3-gitlab. The newest available in bullseye is v2.5.0, whereas upstream newest version is v4.5.0. This causes some issues with the runners api endpoints especially around the new runner endpoints introduced by the runner api refactoring. So a different api endpoint and token is needed to get all Trusted Runners. Fortunately this endpoints does not require admin privileges, so a token with less privileges can be used. I'll revoke the current token and add a new one to private puppet soon.

The exporter contains metrics for all Trusted, WMCS and Cloud Runners now:

# HELP gitlab_runners_up Status of gitlab-runner
# TYPE gitlab_runners_up gauge
gitlab_runners_up{access_level="not_protected",description="Shared Runners, running in Wikimedia Cloud Services",id="1479",instance="gitlab-replica.wikimedia.org",locked="False",runner_type="group_type",tag_list="['wmcs']"} 1.0
gitlab_runners_up{access_level="not_protected",description="Cloud Runners, running in Digital Ocean K8s",id="1480",instance="gitlab-replica.wikimedia.org",locked="False",runner_type="instance_type",tag_list="['kubernetes', 'cloud']"} 1.0
gitlab_runners_up{access_level="not_protected",description="Memory optimized Cloud Runners, running in Digital Ocean K8s",id="1481",instance="gitlab-replica.wikimedia.org",locked="False",runner_type="instance_type",tag_list="['kubernetes', 'cloud', 'memory-optimized']"} 1.0
gitlab_runners_up{access_level="not_protected",description="Staging Cloud Runners, running in Digital Ocean K8s",id="1482",instance="gitlab-replica.wikimedia.org",locked="False",runner_type="instance_type",tag_list="['staging']"} 1.0
gitlab_runners_up{access_level="not_protected",description="Memory optimized staging Cloud Runners, running in Digital Ocean K8s",id="1483",instance="gitlab-replica.wikimedia.org",locked="False",runner_type="instance_type",tag_list="['staging-memory-optimized']"} 1.0
gitlab_runners_up{access_level="ref_protected",description="Trusted Runners, running in production",id="1484",instance="gitlab-replica.wikimedia.org",locked="True",runner_type="project_type",tag_list="['trusted']"} 1.0
gitlab_runners_up{access_level="ref_protected",description="Trusted Dockerfile Runners, running in production",id="1504",instance="gitlab-replica.wikimedia.org",locked="True",runner_type="project_type",tag_list="['trusted', 'dockerfile']"} 1.0

The next step is to enable the Prometheus scraping: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1029169

Change #1029169 merged by Jelto:

[operations/puppet@production] prometheus::ops: scrape custom gitlab exporter

https://gerrit.wikimedia.org/r/1029169

Change #1035370 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/alerts@master] sre: add alert for trusted gitlab-runner config

https://gerrit.wikimedia.org/r/1035370

Change #1035418 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: bump exporter version to v1.0.6

https://gerrit.wikimedia.org/r/1035418

Change #1035418 merged by Jelto:

[operations/puppet@production] gitlab: bump exporter version to v1.0.6

https://gerrit.wikimedia.org/r/1035418

Change #1035370 merged by jenkins-bot:

[operations/alerts@master] sre: add alert for trusted gitlab-runner config

https://gerrit.wikimedia.org/r/1035370

Change #1035746 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/alerts@master] gitlab: fix collaboration-services team name

https://gerrit.wikimedia.org/r/1035746

Unfortunately the python clients for the gitlab api and prometheus are outdated in the Bullseye Debian package. So I had to implement some workarounds in the exporter for missing features/api endpoints.

A alert which fires when one Trusted runner has a wrong config is also live. While testing the new alert I noticed we were still using serviceops-collab as our team name in the alerts repo. So the alert was not routed to our IRC channel or phab board. The change above should fix this issue.

Change #1035746 merged by jenkins-bot:

[operations/alerts@master] gitlab: fix collaboration-services team name

https://gerrit.wikimedia.org/r/1035746

The alert with a quick test on one of the replicas works, see T365802. I'll do some more documentation and then close the task.

Change #1036609 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: bump exporter version to v1.0.9

https://gerrit.wikimedia.org/r/1036609

Change #1036609 merged by Jelto:

[operations/puppet@production] gitlab: bump exporter version to v1.0.9

https://gerrit.wikimedia.org/r/1036609

The exporter is running now on all GitLab instances and an alert for missing config for the Trusted Runners is active. I added some visualizations for the new metrics:

https://grafana-rw.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview?forceLogin&orgId=1&viewPanel=29
https://grafana-rw.wikimedia.org/d/R_1IvBZnz/gitlab-omnibus-overview?orgId=1&refresh=1m&viewPanel=49
https://grafana-rw.wikimedia.org/d/R_1IvBZnz/gitlab-omnibus-overview?orgId=1&refresh=1m

And updated the docs here: https://wikitech.wikimedia.org/wiki/GitLab/Monitoring.

The exporter could be improved by adding proper CI tests and linting checks. Also there are a lot of TODOs in the code due to old client api version (GitLab and Prometheus) which can be removed when GitLab hosts are on Bookworm.

Additional improvements of the exporter can happen outside of this task I think. But we have a working foundation of adding additional custom metrics for GitLab. So I'm closing this task.

Change #1036987 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: bump exporter version to v1.0.10

https://gerrit.wikimedia.org/r/1036987

Change #1036987 merged by Jelto:

[operations/puppet@production] gitlab: bump exporter version to v1.0.10

https://gerrit.wikimedia.org/r/1036987