Page MenuHomePhabricator

Export zuul metrics to Prometheus
Closed, ResolvedPublic

Description

See parent task for more context, tl;dr is that we want zuul / gerrit metrics in Prometheus, ideally switching off statsd completely or using statsd_exporter if Prometheus-native support isn't available.

For zuul here's the dashboard/alert audit:

monitoring::graphite_threshold{ 'zuul_gearman_wait_queue':
    ensure          => $ensure,
    description     => 'Work requests waiting in Zuul Gearman server',
    dashboard_links => ['https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1'],
    metric          => 'zuul.geard.queue.waiting',
    contact_group   => 'contint',
    from            => '15min',
    percentage      => 30,
    warning         => 90,
    critical        => 140,
    notes_link      => 'https://www.mediawiki.org/wiki/Continuous_integration/Zuul',
}
Matched db/continuous-integration (Continuous Integration)
[ '(zuul.pipeline.*.current_changes',
  '(zuul.pipeline.*.all_jobs.count',
  '"zuul.geard.queue.waiting',
  '(zuul.pipeline.*.resident_time.mean' ]
Matched db/releng-kpis (RelEng :: KPIs)
[ '(zuul.pipeline.*.label.ci-*.wait_time.upper',
  '(zuul.pipeline.*.label.ci-*.wait_time.upper',
  '(zuul.pipeline.*.job.*.wait_time.mean',
  '(zuul.pipeline.*.job.*.wait_time.upper',
  '(zuul.pipeline.test-prio.operations.puppet.resident_time.p95, ',
  '(zuul.pipeline.test-prio.job.operations-puppet-tests-stretch-docker.wait_time.p95, ',
  '(zuul.pipeline.*.*.mean',
  '(zuul.pipeline.*.*.mean',
  '(zuul.pipeline.*.job.*.*.mean',
  '(zuul.pipeline.*.current_changes',
  '(zuul.pipeline.gate-and-submit.mediawiki.core.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.mediawiki.core.resident_time.upper, ',
  '(zuul.pipeline.*.job.*.FAILURE.sum' ]
Matched db/zuul (Zuul)
[ '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.resident_time.median, ',
  '(zuul.pipeline.test-prio.resident_time.upper, ',
  '(zuul.pipeline.test-prio.resident_time.upper, ',
  '(zuul.pipeline.test-prio.resident_time.median, ',
  '(zuul.pipeline.*.all_jobs.count',
  '(zuul.pipeline.*.total_changes.count',
  '(zuul.pipeline.*.current_changes',
  '(zuul.pipeline.*.label.*.wait_time.count, 4, ',
  '(zuul.pipeline.*.label.*Docker.wait_time.count, 4, ',
  '(zuul.pipeline.*.label.*Docker.wait_time.count',
  '(zuul.pipeline.*.label.*Docker.wait_time.count, ',
  '(zuul.pipeline.*.label.*Docker.wait_time.upper, ',
  '(zuul.pipeline.*.label.*Docker.wait_time.upper, 4, ',
  '(zuul.pipeline.*.label.{Ubuntu*,DebianJessie',
  '(zuul.pipeline.*.label.{Ubuntu*,DebianJessie',
  '(zuul.pipeline.*.label.{Ubuntu*,DebianJessie,phpflavor-php*',
  '(zuul.pipeline.*.label.{Ubuntu*,DebianJessie',
  '(zuul.pipeline.*.label.{Ubuntu*,DebianJessie',
  '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.resident_time.count, ',
  '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.mediawiki.core.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.mediawiki.core.resident_time.count, ',
  '(zuul.pipeline.test-prio.operations.puppet.resident_time.upper, ',
  '(zuul.pipeline.test-prio.operations.puppet.resident_time.count, ',
  '(zuul.pipeline.test-prio.operations.puppet.resident_time.median, ',
  '(zuul.pipeline.gate-and-submit.operations.mediawiki-config.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.operations.mediawiki-config.resident_time.count, ',
  '(zuul.pipeline.gate-and-submit.operations.mediawiki-config.resident_time.mean, ',
  '(zuul.pipeline.test-prio.operations.puppet.resident_time.sum, ',
  '(zuul.pipeline.test-prio.operations.mediawiki-config.resident_time.sum, ',
  '(zuul.pipeline.gate-and-submit.resident_time.mean, ',
  '(zuul.pipeline.gate-and-submit.mediawiki.extensions.MobileFrontend.resident_time.count, ',
  '(zuul.pipeline.gate-and-submit.resident_time.count, ',
  '(zuul.pipeline.*.current_changes',
  '(zuul.pipeline.*.current_changes' ]
Checking ..............
Matched db/zuul-gearman (Zuul :: Gearman)
[ '"zuul.geard.queue.running',
  '"zuul.geard.queue.total',
  '"zuul.geard.queue.waiting',
  '(zuul.geard.queue.running, ',
  '(zuul.geard.queue.waiting, ',
  '(zuul.geard.queue.total, ',
  '"zuul.geard.queue.running',
  '"zuul.geard.queue.total',
  '"zuul.geard.queue.waiting',
  '(zuul.geard.queue.running, ',
  '(zuul.geard.queue.waiting, ',
  '(zuul.geard.queue.waiting, ',
  '(zuul.geard.queue.running, ',
  '(zuul.geard.workers,\\',
  '(zuul.geard.workers,\\',
  '(zuul.geard.workers, ',
  '(zuul.geard.workers, ',
  '(zuul.geard.queue.waiting',
  '(zuul.geard.queue.waiting',
  '"zuul.geard.queue.running',
  '"zuul.geard.queue.total',
  '"zuul.geard.queue.waiting',
  '(zuul.geard.queue.waiting, ',
  '(zuul.geard.queue.waiting, 5',
  '(zuul.geard.packet.CAN_DO.count, ',
  '(zuul.geard.packet.SUBMIT*.count, ' ]
Checking ...............
Matched db/zuul-pipeline (Zuul :: Pipeline)
[ '(zuul.pipeline.$pipeline.all_jobs.count, ',
  '"zuul.pipeline.*' ]
Checking ................
Matched db/zuul-job (Zuul job)
[ '(zuul.pipeline.$pipeline.job.$job.$status.count',
  '(zuul.pipeline.$pipeline.job.$job.$status.upper, 75',
  '(zuul.pipeline.$pipeline.job.$job.$status.upper, 95',
  '(zuul.pipeline.$pipeline.job.$job.$status.upper, 98',
  '(zuul.pipeline.$pipeline.job.$job.$status.count',
  '(zuul.pipeline.$pipeline.job.$job.$status.count',
  '(zuul.pipeline.*.job.*.$status.count, 2, 5',
  '"zuul.pipeline.*',
  '"zuul.pipeline.*.job.*' ]
Checking .................
Matched db/zuul-top-jobs (Zuul top jobs)
[ '(zuul.pipeline.*.job.*.wait_time.count, 15' ]

Zuul exports also gerrit metrics according to parent task

Matched db/releng-gerrit (RelEng :: Gerrit)
[ '(gerrit.event.change-merged.sum, 10',
  '(gerrit.event.patchset-created.sum, 10',
  '(gerrit.event.comment-added.sum, 10',
  '(gerrit.event.comment-added.sum, 10' ]

Event Timeline

Change 537362 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] ci: add statsd_exporter for zuul/gerrit

https://gerrit.wikimedia.org/r/537362

herron triaged this task as Medium priority.Sep 18 2019, 7:06 PM

Change 537362 abandoned by Filippo Giunchedi:
ci: add statsd_exporter for zuul/gerrit

Reason:
Duplicate of I27b3c86fbeb266, will followup on that instead

https://gerrit.wikimedia.org/r/537362

Change 479139 had a related patch set uploaded (by Filippo Giunchedi; owner: Cwhite):
[operations/puppet@production] ci: define statsd prometheus exporter mappings

https://gerrit.wikimedia.org/r/479139

Is there anything we in RelEng can do to help with this work?

@Jdforrester-WMF Absolutely! If someone would be willing to review and give a stamp of (approval|disapproval) on https://gerrit.wikimedia.org/r/479139, I that would unblock us.

Dzahn subscribed.

If we had this we could maybe remove the Icinga process checks we have for zuul and zuul-merger (T334250) / replace with prometheus monitoring.

Do we have a solution for this yet? I fear that by ignoring it dashboards will break when Graphite is decommissioned.

thcipriani added subscribers: hashar, thcipriani.

Do we have a solution for this yet? I fear that by ignoring it dashboards will break when Graphite is decommissioned.

As far as I know, we don't? It still seems like the statsd exporter is the right path here (unless we re-do zuul's metrics, which we don't have on a roadmap anywhere).

@hashar was declining an indication that we don't need these metrics?

I must have declined this as part of a task triage since I usually leave a comment when closing a task. For the Zuul metrics, yes we still need them to monitor the service and analysis what is going when it fails.

Maybe some kind of metrics might be filtered out from the source, but honestly after seven years I'd prefer not touching Zuul anymore and have it replaced entirely. That might have been why I have declined the task in the first place: there is no replacement solution for now.

colewhite changed the task status from Open to In Progress.Sep 6 2024, 9:56 PM
colewhite claimed this task.

Change #1072632 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] zuul: set statsd-exporter to relay to local statsite instance

https://gerrit.wikimedia.org/r/1072632

Change #1072633 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] zuul: send stats to prometheus-statsd-exporter

https://gerrit.wikimedia.org/r/1072633

Change #1072632 merged by Cwhite:

[operations/puppet@production] zuul: set statsd-exporter to relay to local statsite instance

https://gerrit.wikimedia.org/r/1072632

Change #1072633 merged by Cwhite:

[operations/puppet@production] zuul: send stats to prometheus-statsd-exporter

https://gerrit.wikimedia.org/r/1072633

Mentioned in SAL (#wikimedia-operations) [2024-09-26T16:09:05Z] <cwhite> systemctl restart zuul on contint1002 T233089

Change #1080400 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] ci: capture job completion timer metrics

https://gerrit.wikimedia.org/r/1080400

Change #1080400 merged by Cwhite:

[operations/puppet@production] ci: capture job completion timer metrics

https://gerrit.wikimedia.org/r/1080400

All dashboards in the Release Engineering folder have a Prometheus version, with the exception of a few panels in RelEng :: KPIs (waiting on T377273).

Zuul is effectively migrated at this point and the legacy dashboards have been retired. Calling this done.

FYI the links at the bottom of https://integration.wikimedia.org/zuul/ ( Job Stats section) are still linking deprecated dashboards:

https://grafana.wikimedia.org/d/000000321/zuul?orgId=1
https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1

Change #1105779 had a related patch set uploaded (by Cwhite; author: Cwhite):

[integration/docroot@master] update dashboard links to new panels

https://gerrit.wikimedia.org/r/1105779

Change #1105779 merged by jenkins-bot:

[integration/docroot@master] update dashboard links to new panels

https://gerrit.wikimedia.org/r/1105779

Mentioned in SAL (#wikimedia-operations) [2024-12-19T19:51:13Z] <jforrester@deploy2002> Started deploy [integration/docroot@4701376]: I1ea9f34dc6176da4cca5da50c293bd5ff62661b8 for T233089

Mentioned in SAL (#wikimedia-operations) [2024-12-19T19:51:24Z] <jforrester@deploy2002> Finished deploy [integration/docroot@4701376]: I1ea9f34dc6176da4cca5da50c293bd5ff62661b8 for T233089 (duration: 00m 10s)

Thanks @Volans for pointing those out! With the latest deploy, those links are corrected.