Page MenuHomePhabricator

Export zuul metrics to Prometheus
Open, MediumPublic

Description

See parent task for more context, tl;dr is that we want zuul / gerrit metrics in Prometheus, ideally switching off statsd completely or using statsd_exporter if Prometheus-native support isn't available.

For zuul here's the dashboard/alert audit:

monitoring::graphite_threshold{ 'zuul_gearman_wait_queue':
    ensure          => $ensure,
    description     => 'Work requests waiting in Zuul Gearman server',
    dashboard_links => ['https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1'],
    metric          => 'zuul.geard.queue.waiting',
    contact_group   => 'contint',
    from            => '15min',
    percentage      => 30,
    warning         => 90,
    critical        => 140,
    notes_link      => 'https://www.mediawiki.org/wiki/Continuous_integration/Zuul',
}
Matched db/continuous-integration (Continuous Integration)
[ '(zuul.pipeline.*.current_changes',
  '(zuul.pipeline.*.all_jobs.count',
  '"zuul.geard.queue.waiting',
  '(zuul.pipeline.*.resident_time.mean' ]
Matched db/releng-kpis (RelEng :: KPIs)
[ '(zuul.pipeline.*.label.ci-*.wait_time.upper',
  '(zuul.pipeline.*.label.ci-*.wait_time.upper',
  '(zuul.pipeline.*.job.*.wait_time.mean',
  '(zuul.pipeline.*.job.*.wait_time.upper',
  '(zuul.pipeline.test-prio.operations.puppet.resident_time.p95, ',
  '(zuul.pipeline.test-prio.job.operations-puppet-tests-stretch-docker.wait_time.p95, ',
  '(zuul.pipeline.*.*.mean',
  '(zuul.pipeline.*.*.mean',
  '(zuul.pipeline.*.job.*.*.mean',
  '(zuul.pipeline.*.current_changes',
  '(zuul.pipeline.gate-and-submit.mediawiki.core.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.mediawiki.core.resident_time.upper, ',
  '(zuul.pipeline.*.job.*.FAILURE.sum' ]
Matched db/zuul (Zuul)
[ '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.resident_time.median, ',
  '(zuul.pipeline.test-prio.resident_time.upper, ',
  '(zuul.pipeline.test-prio.resident_time.upper, ',
  '(zuul.pipeline.test-prio.resident_time.median, ',
  '(zuul.pipeline.*.all_jobs.count',
  '(zuul.pipeline.*.total_changes.count',
  '(zuul.pipeline.*.current_changes',
  '(zuul.pipeline.*.label.*.wait_time.count, 4, ',
  '(zuul.pipeline.*.label.*Docker.wait_time.count, 4, ',
  '(zuul.pipeline.*.label.*Docker.wait_time.count',
  '(zuul.pipeline.*.label.*Docker.wait_time.count, ',
  '(zuul.pipeline.*.label.*Docker.wait_time.upper, ',
  '(zuul.pipeline.*.label.*Docker.wait_time.upper, 4, ',
  '(zuul.pipeline.*.label.{Ubuntu*,DebianJessie',
  '(zuul.pipeline.*.label.{Ubuntu*,DebianJessie',
  '(zuul.pipeline.*.label.{Ubuntu*,DebianJessie,phpflavor-php*',
  '(zuul.pipeline.*.label.{Ubuntu*,DebianJessie',
  '(zuul.pipeline.*.label.{Ubuntu*,DebianJessie',
  '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.resident_time.count, ',
  '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.mediawiki.core.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.mediawiki.core.resident_time.count, ',
  '(zuul.pipeline.test-prio.operations.puppet.resident_time.upper, ',
  '(zuul.pipeline.test-prio.operations.puppet.resident_time.count, ',
  '(zuul.pipeline.test-prio.operations.puppet.resident_time.median, ',
  '(zuul.pipeline.gate-and-submit.operations.mediawiki-config.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.operations.mediawiki-config.resident_time.count, ',
  '(zuul.pipeline.gate-and-submit.operations.mediawiki-config.resident_time.mean, ',
  '(zuul.pipeline.test-prio.operations.puppet.resident_time.sum, ',
  '(zuul.pipeline.test-prio.operations.mediawiki-config.resident_time.sum, ',
  '(zuul.pipeline.gate-and-submit.resident_time.mean, ',
  '(zuul.pipeline.gate-and-submit.mediawiki.extensions.MobileFrontend.resident_time.count, ',
  '(zuul.pipeline.gate-and-submit.resident_time.count, ',
  '(zuul.pipeline.*.current_changes',
  '(zuul.pipeline.*.current_changes' ]
Checking ..............
Matched db/zuul-gearman (Zuul :: Gearman)
[ '"zuul.geard.queue.running',
  '"zuul.geard.queue.total',
  '"zuul.geard.queue.waiting',
  '(zuul.geard.queue.running, ',
  '(zuul.geard.queue.waiting, ',
  '(zuul.geard.queue.total, ',
  '"zuul.geard.queue.running',
  '"zuul.geard.queue.total',
  '"zuul.geard.queue.waiting',
  '(zuul.geard.queue.running, ',
  '(zuul.geard.queue.waiting, ',
  '(zuul.geard.queue.waiting, ',
  '(zuul.geard.queue.running, ',
  '(zuul.geard.workers,\\',
  '(zuul.geard.workers,\\',
  '(zuul.geard.workers, ',
  '(zuul.geard.workers, ',
  '(zuul.geard.queue.waiting',
  '(zuul.geard.queue.waiting',
  '"zuul.geard.queue.running',
  '"zuul.geard.queue.total',
  '"zuul.geard.queue.waiting',
  '(zuul.geard.queue.waiting, ',
  '(zuul.geard.queue.waiting, 5',
  '(zuul.geard.packet.CAN_DO.count, ',
  '(zuul.geard.packet.SUBMIT*.count, ' ]
Checking ...............
Matched db/zuul-pipeline (Zuul :: Pipeline)
[ '(zuul.pipeline.$pipeline.all_jobs.count, ',
  '"zuul.pipeline.*' ]
Checking ................
Matched db/zuul-job (Zuul job)
[ '(zuul.pipeline.$pipeline.job.$job.$status.count',
  '(zuul.pipeline.$pipeline.job.$job.$status.upper, 75',
  '(zuul.pipeline.$pipeline.job.$job.$status.upper, 95',
  '(zuul.pipeline.$pipeline.job.$job.$status.upper, 98',
  '(zuul.pipeline.$pipeline.job.$job.$status.count',
  '(zuul.pipeline.$pipeline.job.$job.$status.count',
  '(zuul.pipeline.*.job.*.$status.count, 2, 5',
  '"zuul.pipeline.*',
  '"zuul.pipeline.*.job.*' ]
Checking .................
Matched db/zuul-top-jobs (Zuul top jobs)
[ '(zuul.pipeline.*.job.*.wait_time.count, 15' ]

Zuul exports also gerrit metrics according to parent task

Matched db/releng-gerrit (RelEng :: Gerrit)
[ '(gerrit.event.change-merged.sum, 10',
  '(gerrit.event.patchset-created.sum, 10',
  '(gerrit.event.comment-added.sum, 10',
  '(gerrit.event.comment-added.sum, 10' ]

Event Timeline

Change 537362 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] ci: add statsd_exporter for zuul/gerrit

https://gerrit.wikimedia.org/r/537362

herron triaged this task as Medium priority.Sep 18 2019, 7:06 PM

Change 537362 abandoned by Filippo Giunchedi:
ci: add statsd_exporter for zuul/gerrit

Reason:
Duplicate of I27b3c86fbeb266, will followup on that instead

https://gerrit.wikimedia.org/r/537362

Change 479139 had a related patch set uploaded (by Filippo Giunchedi; owner: Cwhite):
[operations/puppet@production] ci: define statsd prometheus exporter mappings

https://gerrit.wikimedia.org/r/479139

Is there anything we in RelEng can do to help with this work?

@Jdforrester-WMF Absolutely! If someone would be willing to review and give a stamp of (approval|disapproval) on https://gerrit.wikimedia.org/r/479139, I that would unblock us.

Dzahn subscribed.

If we had this we could maybe remove the Icinga process checks we have for zuul and zuul-merger (T334250) / replace with prometheus monitoring.

Do we have a solution for this yet? I fear that by ignoring it dashboards will break when Graphite is decommissioned.

thcipriani added subscribers: hashar, thcipriani.

Do we have a solution for this yet? I fear that by ignoring it dashboards will break when Graphite is decommissioned.

As far as I know, we don't? It still seems like the statsd exporter is the right path here (unless we re-do zuul's metrics, which we don't have on a roadmap anywhere).

@hashar was declining an indication that we don't need these metrics?