Page MenuHomePhabricator

Export zuul metrics to Prometheus
Open, MediumPublic

Description

See parent task for more context, tl;dr is that we want zuul / gerrit metrics in Prometheus, ideally switching off statsd completely or using statsd_exporter if Prometheus-native support isn't available.

For zuul here's the dashboard/alert audit:

monitoring::graphite_threshold{ 'zuul_gearman_wait_queue':
    ensure          => $ensure,
    description     => 'Work requests waiting in Zuul Gearman server',
    dashboard_links => ['https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1'],
    metric          => 'zuul.geard.queue.waiting',
    contact_group   => 'contint',
    from            => '15min',
    percentage      => 30,
    warning         => 90,
    critical        => 140,
    notes_link      => 'https://www.mediawiki.org/wiki/Continuous_integration/Zuul',
}
Matched db/continuous-integration (Continuous Integration)
[ '(zuul.pipeline.*.current_changes',
  '(zuul.pipeline.*.all_jobs.count',
  '"zuul.geard.queue.waiting',
  '(zuul.pipeline.*.resident_time.mean' ]
Matched db/releng-kpis (RelEng :: KPIs)
[ '(zuul.pipeline.*.label.ci-*.wait_time.upper',
  '(zuul.pipeline.*.label.ci-*.wait_time.upper',
  '(zuul.pipeline.*.job.*.wait_time.mean',
  '(zuul.pipeline.*.job.*.wait_time.upper',
  '(zuul.pipeline.test-prio.operations.puppet.resident_time.p95, ',
  '(zuul.pipeline.test-prio.job.operations-puppet-tests-stretch-docker.wait_time.p95, ',
  '(zuul.pipeline.*.*.mean',
  '(zuul.pipeline.*.*.mean',
  '(zuul.pipeline.*.job.*.*.mean',
  '(zuul.pipeline.*.current_changes',
  '(zuul.pipeline.gate-and-submit.mediawiki.core.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.mediawiki.core.resident_time.upper, ',
  '(zuul.pipeline.*.job.*.FAILURE.sum' ]
Matched db/zuul (Zuul)
[ '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.resident_time.median, ',
  '(zuul.pipeline.test-prio.resident_time.upper, ',
  '(zuul.pipeline.test-prio.resident_time.upper, ',
  '(zuul.pipeline.test-prio.resident_time.median, ',
  '(zuul.pipeline.*.all_jobs.count',
  '(zuul.pipeline.*.total_changes.count',
  '(zuul.pipeline.*.current_changes',
  '(zuul.pipeline.*.label.*.wait_time.count, 4, ',
  '(zuul.pipeline.*.label.*Docker.wait_time.count, 4, ',
  '(zuul.pipeline.*.label.*Docker.wait_time.count',
  '(zuul.pipeline.*.label.*Docker.wait_time.count, ',
  '(zuul.pipeline.*.label.*Docker.wait_time.upper, ',
  '(zuul.pipeline.*.label.*Docker.wait_time.upper, 4, ',
  '(zuul.pipeline.*.label.{Ubuntu*,DebianJessie',
  '(zuul.pipeline.*.label.{Ubuntu*,DebianJessie',
  '(zuul.pipeline.*.label.{Ubuntu*,DebianJessie,phpflavor-php*',
  '(zuul.pipeline.*.label.{Ubuntu*,DebianJessie',
  '(zuul.pipeline.*.label.{Ubuntu*,DebianJessie',
  '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.resident_time.count, ',
  '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.mediawiki.core.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.mediawiki.core.resident_time.count, ',
  '(zuul.pipeline.test-prio.operations.puppet.resident_time.upper, ',
  '(zuul.pipeline.test-prio.operations.puppet.resident_time.count, ',
  '(zuul.pipeline.test-prio.operations.puppet.resident_time.median, ',
  '(zuul.pipeline.gate-and-submit.operations.mediawiki-config.resident_time.upper, ',
  '(zuul.pipeline.gate-and-submit.operations.mediawiki-config.resident_time.count, ',
  '(zuul.pipeline.gate-and-submit.operations.mediawiki-config.resident_time.mean, ',
  '(zuul.pipeline.test-prio.operations.puppet.resident_time.sum, ',
  '(zuul.pipeline.test-prio.operations.mediawiki-config.resident_time.sum, ',
  '(zuul.pipeline.gate-and-submit.resident_time.mean, ',
  '(zuul.pipeline.gate-and-submit.mediawiki.extensions.MobileFrontend.resident_time.count, ',
  '(zuul.pipeline.gate-and-submit.resident_time.count, ',
  '(zuul.pipeline.*.current_changes',
  '(zuul.pipeline.*.current_changes' ]
Checking ..............
Matched db/zuul-gearman (Zuul :: Gearman)
[ '"zuul.geard.queue.running',
  '"zuul.geard.queue.total',
  '"zuul.geard.queue.waiting',
  '(zuul.geard.queue.running, ',
  '(zuul.geard.queue.waiting, ',
  '(zuul.geard.queue.total, ',
  '"zuul.geard.queue.running',
  '"zuul.geard.queue.total',
  '"zuul.geard.queue.waiting',
  '(zuul.geard.queue.running, ',
  '(zuul.geard.queue.waiting, ',
  '(zuul.geard.queue.waiting, ',
  '(zuul.geard.queue.running, ',
  '(zuul.geard.workers,\\',
  '(zuul.geard.workers,\\',
  '(zuul.geard.workers, ',
  '(zuul.geard.workers, ',
  '(zuul.geard.queue.waiting',
  '(zuul.geard.queue.waiting',
  '"zuul.geard.queue.running',
  '"zuul.geard.queue.total',
  '"zuul.geard.queue.waiting',
  '(zuul.geard.queue.waiting, ',
  '(zuul.geard.queue.waiting, 5',
  '(zuul.geard.packet.CAN_DO.count, ',
  '(zuul.geard.packet.SUBMIT*.count, ' ]
Checking ...............
Matched db/zuul-pipeline (Zuul :: Pipeline)
[ '(zuul.pipeline.$pipeline.all_jobs.count, ',
  '"zuul.pipeline.*' ]
Checking ................
Matched db/zuul-job (Zuul job)
[ '(zuul.pipeline.$pipeline.job.$job.$status.count',
  '(zuul.pipeline.$pipeline.job.$job.$status.upper, 75',
  '(zuul.pipeline.$pipeline.job.$job.$status.upper, 95',
  '(zuul.pipeline.$pipeline.job.$job.$status.upper, 98',
  '(zuul.pipeline.$pipeline.job.$job.$status.count',
  '(zuul.pipeline.$pipeline.job.$job.$status.count',
  '(zuul.pipeline.*.job.*.$status.count, 2, 5',
  '"zuul.pipeline.*',
  '"zuul.pipeline.*.job.*' ]
Checking .................
Matched db/zuul-top-jobs (Zuul top jobs)
[ '(zuul.pipeline.*.job.*.wait_time.count, 15' ]

Zuul exports also gerrit metrics according to parent task

Matched db/releng-gerrit (RelEng :: Gerrit)
[ '(gerrit.event.change-merged.sum, 10',
  '(gerrit.event.patchset-created.sum, 10',
  '(gerrit.event.comment-added.sum, 10',
  '(gerrit.event.comment-added.sum, 10' ]

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 17 2019, 8:47 AM
fgiunchedi updated the task description. (Show Details)Sep 17 2019, 8:54 AM

Change 537362 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] ci: add statsd_exporter for zuul/gerrit

https://gerrit.wikimedia.org/r/537362

herron triaged this task as Medium priority.Sep 18 2019, 7:06 PM

Change 537362 abandoned by Filippo Giunchedi:
ci: add statsd_exporter for zuul/gerrit

Reason:
Duplicate of I27b3c86fbeb266, will followup on that instead

https://gerrit.wikimedia.org/r/537362

Change 479139 had a related patch set uploaded (by Filippo Giunchedi; owner: Cwhite):
[operations/puppet@production] ci: define statsd prometheus exporter mappings

https://gerrit.wikimedia.org/r/479139

Is there anything we in RelEng can do to help with this work?

@Jdforrester-WMF Absolutely! If someone would be willing to review and give a stamp of (approval|disapproval) on https://gerrit.wikimedia.org/r/479139, I that would unblock us.

fgiunchedi moved this task from Inbox to Backlog on the observability board.Apr 6 2020, 12:35 PM
lmata moved this task from Externally blocked to Radar on the observability board.Sep 21 2020, 8:28 PM