Page MenuHomePhabricator

Define monitoring for gitlab
Closed, ResolvedPublic

Event Timeline

FWIW here's a quick review of current gerrit alerting in case it helps when thinking about checks to include in gitlab monitoring.

HTTPS health check https://gerrit.wikimedia.org/r/config/server/healthcheck~status
HTTPS size check if https://gerrit.wikimedia.org/r/changes/?n=25&O=81 is larger than 10K
HTTPS certificate check
SSHD check gerrit.wikimedia.org on port 29418
Check if gerrit process is running on host
Ping check gerrit.wikimedia.org

There is also a grafana dashboard here https://grafana.wikimedia.org/d/000000063/releng-gerrit

greg triaged this task as Medium priority.Feb 24 2021, 5:38 PM

For Prometheus/Grafana:

I believe jbond can add gitlab1001's exporters to our prometheus via puppet. He needs to know the ports that provide gitlab information.

Then we need the grafana dashboards that show gitlab functioning, i.e. what makes sense to graph here, are there thresholds that we should alert on

For icinga: should we do similar checks?

Commenting to bring together a few threads of conversation.

Via email I asked the following questions:

  1. Is there any special network access requirement? I imagine since prometheus is a pull system that this is not the case; however, there may be emergent patterns for setting up prometheus.
  2. Is there any complication from having a service running on a Ganeti VM not behind load balancing? My understanding (which is limited) is that this is somewhat novel compared to the rest of our production services.
  3. Gitlab comes with a pretty extensive set of rules that seem larger than anything I found in our puppet repo; what are your thoughts/concerns about using third-party rules/alerts?

Quoting responses from @fgiunchedi and @lmata (please correct me if I leave out important detail)

  1. Not a problem
  2. Not a problem
  3. Mostly a direct quote from @fgiunchedi > I think for sure gitlab.rules applies, node.rules is likely redundant though as we have similar rules already in place. WRT alerts defined in gitlab.rules we have the operations/alerts.git repository available for self-service alerts (via AlertManager), although committing the alerts+rules in puppet for now I think will work as well.

As soon as we have a running production Gitlab instance I will post an update here with the list of Prometheus exporter endpoints to pull.

Besides default monitors of instance health (disk space, CPU, memory), for monitoring Gitlab health we suggest to start with a set of Gitlab-curated dashboards created to monitor Omnibus installs: https://gitlab.com/gitlab-org/grafana-dashboards/-/tree/master/omnibus
(These dashboards assume a datasource named GitLab Omnibus. They can be imported via the Import button, or via provisioning APIs)

For the inspiration for later stages, here's a (very) long list of dashboards Gitlab uses to monitor gitlab.com: https://gitlab.com/gitlab-org/grafana-dashboards/-/tree/master/dashboards

As soon as we have a running production Gitlab instance I will post an update here with the list of Prometheus exporter endpoints to pull.

This may allready been in hand however i noticed that the gitlab in built node-exporter runs on the same port as the node-exporter we allready have installed on productions hosts as such i think the following is required for the gitlab.rb

node_exporter['enable'] = false

Further as we already have an external proemetheous, alert-manager and grafana set up in production i think it also makes sense to also apply the following:

prometheus['enable'] = false
grafana['enable'] = false
alertmanager['enable'] = false

Along with the other settings for an eternal prometheus server

Along with the other settings for an eternal prometheus server

Makes sense. We were using internal dashboards to assess performance etc on the dev instance, but will disable it for the production environment.

What would be the monitoring IP range(s) to whitelist, at least for a start?

(I like the idea of eternal Prometheus though, jk)

From now on this is going to be default:

# Monitoring configuration
gitlab_prometheus_enable: "false"
gitlab_grafana_enable: "false"
gitlab_alertmanager_enable: "false"
gitlab_gitlab_exporter_enable: "false"
gitlab_node_exporter_enable: "false"
gitlab_postgres_exporter_enable: "false"
gitlab_redis_exporter_enable: "false"

We'll put whitelist ranges in when available.

What would be the monitoring IP range(s) to whitelist, at least for a start?

i think 10/8 or 10.64/16 should be good, @herron can you confirm (we have iptables blocking theses ports regardless so dont need to be too strict here)

This is configurable now: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/gitlab-ansible/+/8c0007586e4b25da076a8ce624411050bd9abe9b%5E%21/#F0

It can be updated when more info is available.

Note this one, apparently, a specific Prometheus endpoint has to be configured for Gitlab.

+#gitlab_prometheus_address: "prometheus:9090"

gitlab_prometheus_address: "prometheus:9090"

Can you give information about what this parameter actually configures. We have 7 prometheus servers without more context its hard to know what to put here.

On a closer inspection, this setting is not directly monitoring related in the context of this ticket. It configures Prometheus server for Gitlab UI Prometheus integration: https://docs.gitlab.com/ee/user/project/integrations/prometheus.html

On a closer inspection, this setting is not directly monitoring related in the context of this ticket. It configures Prometheus server for Gitlab UI Prometheus integration: https://docs.gitlab.com/ee/user/project/integrations/prometheus.html

ahh that makes a bit more sense although still not sure what the correct value would be, will leave that to observability (@lmata) however i suspect its out of scope for the MVP (@thcipriani, @wkandek)

I think we should enable the following exporters:

gitlab_gitlab_exporter_enable: "true"
gitlab_postgres_exporter_enable: "true"
gitlab_redis_exporter_enable: "true"

get the port that they are listening on and open the firewall.

Is this something you guys will be changing (and deploying)? Just to verify that we are on the same page here.

On my test machine I see the following:

  • port 9168 gitlab_exporter
  • port 9187 postgres_exporter
  • port 9121 redis exporter
  • port 9100 is the node_exporter that runs already on gitlab1001, so no need to activate that

and the firewall rules on gitlab1001 account for port 9100 (and 9105?) already
ACCEPT tcp -- prometheus1003.eqiad.wmnet anywhere tcp dpt:9100
ACCEPT tcp -- prometheus1004.eqiad.wmnet anywhere tcp dpt:9100
ACCEPT tcp -- prometheus1003.eqiad.wmnet anywhere tcp dpt:9105
ACCEPT tcp -- prometheus1004.eqiad.wmnet anywhere tcp dpt:9105

vagrant@omnibus:~$ ps -ef | grep -exporter
root 1361 1348 0 20:46 ? 00:00:00 runsv node-exporter
root 1362 1348 0 20:46 ? 00:00:00 runsv redis-exporter
root 1363 1348 0 20:46 ? 00:00:00 runsv gitlab-exporter
root 1364 1348 0 20:46 ? 00:00:00 runsv postgres-exporter
root 1377 1361 0 20:46 ? 00:00:00 svlogd -tt /var/log/gitlab/node-exporter
root 1378 1363 0 20:46 ? 00:00:00 svlogd -tt /var/log/gitlab/gitlab-exporter
root 1391 1364 0 20:46 ? 00:00:00 svlogd -tt /var/log/gitlab/postgres-exporter
root 1393 1362 0 20:46 ? 00:00:00 svlogd -tt /var/log/gitlab/redis-exporter
git 1447 1363 1 20:46 ? 00:00:48 /opt/gitlab/embedded/bin/ruby /opt/gitlab/embedded/bin/gitlab-exporter web -c /var/opt/gitlab/gitlab-exporter/gitlab-exporter.yml
gitlab-+ 1449 1364 0 20:46 ? 00:00:07 /opt/gitlab/embedded/bin/postgres_exporter --web.listen-address=localhost:9187 --extend.query-path=/var/opt/gitlab/postgres-exporter/queries.yaml
gitlab-+ 1450 1362 0 20:46 ? 00:00:02 /opt/gitlab/embedded/bin/redis_exporter --web.listen-address=localhost:9121 --redis.addr=unix:///var/opt/gitlab/redis/redis.socket
gitlab-+ 1456 1361 0 20:46 ? 00:00:13 /opt/gitlab/embedded/bin/node_exporter --web.listen-address=localhost:9100 --collector.mountstats --collector.runit --collector.runit.servicedir=/opt/gitlab/sv --collector.textfile.directory=/var/opt/gitlab/node-exporter/textfile_collector
vagrant 21143 20485 0 22:06 pts/1 00:00:00 grep --color=auto -exporter
vagrant@omnibus:~$
vagrant@omnibus:~$
vagrant@omnibus:~$
vagrant@omnibus:~$ sudo netstat -antp | grep 1447
tcp 0 0 127.0.0.1:9168 0.0.0.0:* LISTEN 1447/ruby
tcp 0 0 127.0.0.1:9168 127.0.0.1:51630 ESTABLISHED 1447/ruby
tcp 0 0 127.0.0.1:9168 127.0.0.1:51628 ESTABLISHED 1447/ruby
vagrant@omnibus:~$ sudo netstat -antp | grep 1449
tcp 0 0 127.0.0.1:9187 0.0.0.0:* LISTEN 1449/postgres_expor
tcp 0 0 127.0.0.1:9187 127.0.0.1:34866 ESTABLISHED 1449/postgres_expor
vagrant@omnibus:~$ sudo netstat -antp | grep 1450
tcp 0 0 127.0.0.1:9121 0.0.0.0:* LISTEN 1450/redis_exporter
tcp 0 0 127.0.0.1:9121 127.0.0.1:54700 ESTABLISHED 1450/redis_exporter

I checked the current exporter configuration.
Host/node metrics are available by node_exporter. The node_exporter in GitLab is disabled because we are using the stand-alone node_exporter. The host metrics are available in Grafana already. So host/node metrics should be fine.

Regarding application specific metrics, the production GitLab instance has the following exporters enabled (see host_vars/gitlab-server-prod):

gitlab_gitlab_exporter_enable: "true"
gitlab_postgres_exporter_enable: "true"
gitlab_redis_exporter_enable: "true"

I checked on gitlab1001 and all the exporters produce metrics.

So the next step would be to add the GitLab application specific endpoints to the Prometheus scrape config. The docs contain a example. I adopted the scrape config to our instance and removed the not used/duplicate entries:

scrape_configs:
  - job_name: nginx
    static_configs:
      - targets:
        - gitlab1001.wikimedia.org:8060
  - job_name: redis
    static_configs:
      - targets:
        - gitlab1001.wikimedia.org:9121
  - job_name: postgres
    static_configs:
      - targets:
        - gitlab1001.wikimedia.org:9187
  - job_name: gitlab-workhorse
    static_configs:
      - targets:
        - gitlab1001.wikimedia.org:9229
  - job_name: gitlab-rails
    metrics_path: "/-/metrics"
    static_configs:
      - targets:
        - gitlab1001.wikimedia.org:8080
  - job_name: gitlab-sidekiq
    static_configs:
      - targets:
        - gitlab1001.wikimedia.org:8082
  - job_name: gitlab
    static_configs:
      - targets:
        - gitlab1001.wikimedia.org:9168
  - job_name: gitaly
    static_configs:
      - targets:
        - gitlab1001.wikimedia.org:9236

Gitlab offers multiple metrics endpoints (rails, workhorse, sidekiq, gitaly). I would recommend to scrape all of them so that the prebuild dashboards have the needed metrics. Most dashboards should work without the additional recording rules. So we don't have to touch the rules and alerts at first.

I'm not 100% familiar with the current Prometheus setup so I'm not sure which is the correct Prometheus instance to configure the additional targets (prometheus1003 and prometheus1004?). Maybe somebody can point me to the right repo with the Prometheus scrape config.

I checked the current exporter configuration.
Host/node metrics are available by node_exporter. The node_exporter in GitLab is disabled because we are using the stand-alone node_exporter. The host metrics are available in Grafana already. So host/node metrics should be fine.

Indeed, I can confirm that's the case

Regarding application specific metrics, the production GitLab instance has the following exporters enabled (see host_vars/gitlab-server-prod):

gitlab_gitlab_exporter_enable: "true"
gitlab_postgres_exporter_enable: "true"
gitlab_redis_exporter_enable: "true"

I checked on gitlab1001 and all the exporters produce metrics.

So the next step would be to add the GitLab application specific endpoints to the Prometheus scrape config. The docs contain a example. I adopted the scrape config to our instance and removed the not used/duplicate entries:

[cut]

  • job_name: gitaly static_configs:
    • targets:
      • gitlab1001.wikimedia.org:9236
Gitlab offers multiple metrics endpoints (rails, workhorse, sidekiq, gitaly). I would recommend to scrape all of them so that the [prebuild dashboards](https://gitlab.com/gitlab-org/grafana-dashboards/-/tree/master/omnibus) have the needed metrics. Most dashboards should work without the [additional recording rules](https://gitlab.com/gitlab-org/omnibus-gitlab/tree/master/files/gitlab-cookbooks/monitoring/templates/rules). So we don't have to touch the rules and alerts at first.

I'm not 100% familiar with the current Prometheus setup so I'm not sure which is the correct Prometheus instance to configure the additional targets (prometheus1003 and prometheus1004?). Maybe somebody can point me to the right repo with the Prometheus scrape config.

We have indeed prometheus hosts (e.g. prometheus100[34]) and those run multiple prometheus servers. The instance that makes the most sense to me is ops (i.e. SRE's former name). The core of the puppetization relevant for this is modules/profile/manifests/prometheus/ops.pp and the configuration follows this pattern in most cases:

  1. Configure jobs within ops.pp to use file_sd
  2. Let puppet autogenerate target files (via puppetdb) based on e.g. prometheus::class_config to be able to say "target all hosts running this class/role/etc"

The define prometheus::class_config is used essentially to query puppetdb and fetch the metadata we need (hostnames, clusters, etc). In ops.pp you'll find some existing jobs (e.g. for redis, nginx, postgres) and IMHO it makes sense to append the gitlab targets to those where it makes sense (or define new jobs). Whereas for "brand new" software/jobs you'll have to create both the target files and the jobs definitions themselves (e.g. the $postgresql_jobs array). In all cases you'll need some kind of "hook class" in puppet to query puppetdb (could be the role, or the exporter's class, depending on what makes sense).

Hope that makes sense! Please reach out with questions/code reviews/etc

Change 704503 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] prometheus::ops add jobs to scrape gitlab metrics

https://gerrit.wikimedia.org/r/704503

Change 704503 merged by Jelto:

[operations/puppet@production] prometheus::ops add jobs and ferm rule to scrape gitlab metrics

https://gerrit.wikimedia.org/r/704503

Change 705761 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] Revert "prometheus::ops add jobs and ferm rule to scrape gitlab metrics"

https://gerrit.wikimedia.org/r/705761

Change 705761 merged by Jelto:

[operations/puppet@production] Revert "prometheus::ops add jobs to scrape gitlab metrics"

https://gerrit.wikimedia.org/r/705761

Change 705930 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/gitlab-ansible@master] make prometheus exporters reachable

https://gerrit.wikimedia.org/r/705930

Change 705930 merged by Brennen Bearnes:

[operations/gitlab-ansible@master] make prometheus exporters reachable

https://gerrit.wikimedia.org/r/705930

Mentioned in SAL (#wikimedia-releng) [2021-07-21T19:06:41Z] <brennen> gitlab1001: running ansible to deploy nginx logging and status changes (T274462, T275170)

Change 706396 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/gitlab-ansible@master] fix puma and sidekiq exporter listen address

https://gerrit.wikimedia.org/r/706396

Change 706396 merged by Brennen Bearnes:

[operations/gitlab-ansible@master] fix puma and sidekiq exporter listen address

https://gerrit.wikimedia.org/r/706396

Mentioned in SAL (#wikimedia-operations) [2021-07-22T16:56:26Z] <brennen> gitlab1001: running ansible to deploy [[gerrit:706396]] (T275170)

Change 707236 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/gitlab-ansible@master] fix puma exporter listen address

https://gerrit.wikimedia.org/r/707236

Change 707236 merged by Brennen Bearnes:

[operations/gitlab-ansible@master] fix puma exporter listen address

https://gerrit.wikimedia.org/r/707236

Mentioned in SAL (#wikimedia-operations) [2021-07-23T14:16:30Z] <brennen> gitlab1001: running ansible to deploy [[gerrit:707236|fix puma exporter listen address]] (T275170)

Change 707859 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] move gitlab rails exporter to port 8083

https://gerrit.wikimedia.org/r/707859

Change 707860 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] prometheus::ops add job to scrape gitlab metrics

https://gerrit.wikimedia.org/r/707860

Change 707859 merged by Jelto:

[operations/puppet@production] move gitlab rails exporter to port 8083

https://gerrit.wikimedia.org/r/707859

Change 707860 merged by Jelto:

[operations/puppet@production] prometheus::ops add job to scrape gitlab metrics

https://gerrit.wikimedia.org/r/707860

Change 708291 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] prometheus::ops fix gitlab rails metric path

https://gerrit.wikimedia.org/r/708291

Change 708291 merged by Jelto:

[operations/puppet@production] prometheus::ops fix gitlab rails metric path

https://gerrit.wikimedia.org/r/708291

The scrape configuration for GitLab is in place and Prometheus collects metrics.

I imported and adapted the upstream GitLab Grafana dashboads and created a dedicated GitLab folder in Grafana: https://grafana.wikimedia.org/dashboards/f/mtrpIBZ7z/gitlab . Some Dashboards are still empty. I will go through all dashboards and panels as soon as we have some more metrics (~1 day).

Change 708530 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] icinga::monitor::gitlab add alerts for https and ssh for gitlab

https://gerrit.wikimedia.org/r/708530

Change 708530 merged by Jelto:

[operations/puppet@production] icinga::monitor::gitlab add alerts for https and ssh for gitlab

https://gerrit.wikimedia.org/r/708530

Basic Icinga alerts for the public https and SSH endpoints of GitLab are in place now: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=gitlab.wikimedia.org

I also fixed the empty Grafana dashboards: https://grafana.wikimedia.org/dashboards/f/mtrpIBZ7z/gitlab

Some dashboards and alerts indicate that GitLab Rails service has occasional reduced availability (~99.5%).

https://grafana.wikimedia.org/d/R_1IvBZnz/gitlab-omnibus-overview?viewPanel=6&orgId=1&from=1627516800000&to=1627646400000

https://grafana.wikimedia.org/d/R_1IvBZnz/gitlab-omnibus-overview?viewPanel=15&orgId=1&from=1627516800000&to=1627646400000

I will try to dig a bit deeper and check if this a exporter issue only or if GitLabs availability is reduced in general during this time.

GitLab rails service has reduced availability roughly every ~24 hours (plus some offsets). See this Grafana dashboard.

The reduced availability also reports to #wikimedia-operations as

PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets

After some debugging on gitlab1001 I found that reduced service availability correlates with the automated restart of puma/rails workers:

root@gitlab1001:/var/log/gitlab/puma# zgrep "Rolling Restart" puma_stdout.log.1.gz 
{"timestamp":"2021-08-01T09:24:44.338Z","pid":21789,"message":"PumaWorkerKiller: Rolling Restart. 8 workers consuming total: 7339.90234375 mb. Sending TERM to pid 1390."}
{"timestamp":"2021-08-01T09:25:44.338Z","pid":21789,"message":"PumaWorkerKiller: Rolling Restart. 8 workers consuming total: 7339.90234375 mb. Sending TERM to pid 1983."}
{"timestamp":"2021-08-01T09:26:44.339Z","pid":21789,"message":"PumaWorkerKiller: Rolling Restart. 8 workers consuming total: 7339.90234375 mb. Sending TERM to pid 2823."}
{"timestamp":"2021-08-01T09:27:44.340Z","pid":21789,"message":"PumaWorkerKiller: Rolling Restart. 8 workers consuming total: 7339.90234375 mb. Sending TERM to pid 1833."}
{"timestamp":"2021-08-01T09:28:44.341Z","pid":21789,"message":"PumaWorkerKiller: Rolling Restart. 8 workers consuming total: 7339.90234375 mb. Sending TERM to pid 1046."}
{"timestamp":"2021-08-01T09:29:44.341Z","pid":21789,"message":"PumaWorkerKiller: Rolling Restart. 8 workers consuming total: 7339.90234375 mb. Sending TERM to pid 1558."}
{"timestamp":"2021-08-01T09:30:44.342Z","pid":21789,"message":"PumaWorkerKiller: Rolling Restart. 8 workers consuming total: 7339.90234375 mb. Sending TERM to pid 893."}
{"timestamp":"2021-08-01T09:31:44.343Z","pid":21789,"message":"PumaWorkerKiller: Rolling Restart. 8 workers consuming total: 7339.90234375 mb. Sending TERM to pid 1690."}

As described here restarting of the puma workers happens every 12h or if a memory limit for the workers is exceeded. I could not see any user facing reduced availability during this restarts.

However the restarts happen every 12h but the alert only happens once every 24h. I currently try to find the reason why one restart has no impact on the service availability whereas the other restart has impact.

GitLab rails service has reduced availability roughly every ~24 hours (plus some offsets). See this Grafana dashboard.

The reduced availability also reports to #wikimedia-operations as

PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets

After some debugging on gitlab1001 I found that reduced service availability correlates with the automated restart of puma/rails workers:

root@gitlab1001:/var/log/gitlab/puma# zgrep "Rolling Restart" puma_stdout.log.1.gz 
{"timestamp":"2021-08-01T09:24:44.338Z","pid":21789,"message":"PumaWorkerKiller: Rolling Restart. 8 workers consuming total: 7339.90234375 mb. Sending TERM to pid 1390."}
{"timestamp":"2021-08-01T09:25:44.338Z","pid":21789,"message":"PumaWorkerKiller: Rolling Restart. 8 workers consuming total: 7339.90234375 mb. Sending TERM to pid 1983."}
{"timestamp":"2021-08-01T09:26:44.339Z","pid":21789,"message":"PumaWorkerKiller: Rolling Restart. 8 workers consuming total: 7339.90234375 mb. Sending TERM to pid 2823."}
{"timestamp":"2021-08-01T09:27:44.340Z","pid":21789,"message":"PumaWorkerKiller: Rolling Restart. 8 workers consuming total: 7339.90234375 mb. Sending TERM to pid 1833."}
{"timestamp":"2021-08-01T09:28:44.341Z","pid":21789,"message":"PumaWorkerKiller: Rolling Restart. 8 workers consuming total: 7339.90234375 mb. Sending TERM to pid 1046."}
{"timestamp":"2021-08-01T09:29:44.341Z","pid":21789,"message":"PumaWorkerKiller: Rolling Restart. 8 workers consuming total: 7339.90234375 mb. Sending TERM to pid 1558."}
{"timestamp":"2021-08-01T09:30:44.342Z","pid":21789,"message":"PumaWorkerKiller: Rolling Restart. 8 workers consuming total: 7339.90234375 mb. Sending TERM to pid 893."}
{"timestamp":"2021-08-01T09:31:44.343Z","pid":21789,"message":"PumaWorkerKiller: Rolling Restart. 8 workers consuming total: 7339.90234375 mb. Sending TERM to pid 1690."}

As described here restarting of the puma workers happens every 12h or if a memory limit for the workers is exceeded. I could not see any user facing reduced availability during this restarts.

However the restarts happen every 12h but the alert only happens once every 24h. I currently try to find the reason why one restart has no impact on the service availability whereas the other restart has impact.

Interesting. Monitoring shows workers hitting some memory limit: https://grafana.wikimedia.org/d/R_1IvBZnz/gitlab-omnibus-overview?viewPanel=15&orgId=1&from=1627897995403&to=1627901984407 which degrades service availability: https://grafana.wikimedia.org/d/R_1IvBZnz/gitlab-omnibus-overview?viewPanel=15&orgId=1&from=1627897995403&to=1627901984407 There don't seem to be user requests at this time—I'm curious what's filling worker memory and what work is happening during these restarts.

The puma workers get killed either if they hit memory limit (which is puma['per_worker_max_memory_mb'] = 1024 ~1gb by default) or automatically after 12 hours. I can't see one of the workers hitting the limit in the past. They are getting killed because the uptime is above 12 hours and during the restart the service availability is reduced. My assumption is that some parts of the puma webserver/GitLab have memory leaks. This is the reason for the filling memory and why GitLab uses tools like puma_worker_killer. I try to debug and troubleshoot this problem on gitlab2001 a little bit further. I would like to have a smooth restart of the puma workers without reduced availability.

Speaking about alerting, currently we have basic vm checks in icinga (which are configured by default for all machines). Additionally checks for the SSH and HTTPS endpoint are configured for gitlab.wikimedia.org (see icinga). Furthermore we have basic alerting from alertmanager in case multiple services are "down" (rails, postgres, redis, sidekiq, workhorse and gitaly).

Do you think about any additional alerts we need for the current setup before we can finish this task? There are at least 11 alerts in the upstream alertmanager config. But in my opinion this alerts have little use for #wikimedia-operations. I would prefere to add alerts later on if we notice something is missing.

I would like to either finish this task or add additional requirements. Currently we are collecting metrics of all GitLab components on gitlab1001 and gitlab2001. We have Grafana dashboards and basic alerts in icinga.

@brennen do you miss some additional alerts? Maybe something more application specific? As I mentioned there are at least 11 alerts in the upstream alertmanager config but I don't see a major benefit in using them in the current setup.

@brennen do you miss some additional alerts? Maybe something more application specific? As I mentioned there are at least 11 alerts in the upstream alertmanager config but I don't see a major benefit in using them in the current setup.

I think we're good for the moment. Let's go ahead and close this one and file new tasks for anything else that comes up. Thanks!

Adding the history of changes that we should have all linked here.

Also this is a way to share information with @thcipriani because we have talked a little about the new prometheus::blackbox monitoring today. And so that I have something to link to to share with others.

Here is the history of what we added so far.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/807201 - alertmanager: create receivers for serviceops-collab (creates a team / notification methods)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/809300 - alertmanager: change phab project for automated tasks for serviceops (picks the right tag for automatic Phab tickets)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/809303 - alertmanager: replace email address for service-ops-collab notifications (pick a more specific subteam)


create actual check using the receivers:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/806476 - gitlab: add prometheus blackbox http monitor (creates a check using the receivers)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/810899 - Revert "gitlab: add prometheus blackbox http monitor" (broke puppet at first)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/811276 - revert revert, debugging

https://gerrit.wikimedia.org/r/c/operations/puppet/+/811349 - gitlab/prometheus: 'body_regex_matches' expects an Array value, got String

https://gerrit.wikimedia.org/r/c/operations/puppet/+/811882 - gitlab: fix IPs, hostname and regex for blackbox check

One currently open issue is that we need to downtime alerts while the gitlab restore process is running daily for about 15 minutes on the gitlab-replica machine.

Optionally we could reopen this ticket for just a short time until we declare it done and link the alert dashboard.

Change 812427 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gitlab: for now, only monitor the active host, not the replica

https://gerrit.wikimedia.org/r/812427

Change 812427 merged by Dzahn:

[operations/puppet@production] gitlab: for now, only monitor the active host, not the replica

https://gerrit.wikimedia.org/r/812427