Page MenuHomePhabricator

Leverage Grafana annotations to show events in graphs
Closed, ResolvedPublic

Description

Investigate possibility of using logging cluster events as grafana annotations (https://grafana.com/docs/reference/annotations/)

Some ideas for possibly useful events:

  • Puppet merges and runs
  • Icinga alerts
  • SAL entries
  • Deploys
  • Downtimes

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+110 -0
operations/puppetproduction+25 -3
operations/puppetproduction+24 -11
operations/puppetproduction+207 -0
schemas/event/secondarymaster+222 -0
labs/tools/stashbotmaster+21 -0
operations/puppetproduction+119 -1
operations/puppetproduction+14 -2
operations/puppetproduction+54 -0
operations/puppetproduction+4 -1
operations/puppetproduction+6 -0
operations/puppetproduction+4 -0
operations/puppetproduction+62 -2
operations/puppetproduction+15 -3
operations/puppetproduction+5 -0
operations/puppetproduction+103 -27
operations/puppetproduction+88 -0
operations/puppetproduction+6 -1
operations/puppetproduction+2 -1
operations/puppetproduction+6 -0
operations/puppetproduction+12 -37
operations/puppetproduction+81 -50
operations/software/ecsmaster+56 -0
operations/software/ecsmaster+51 -1
operations/puppetproduction+9 -1
operations/debs/prometheus-es-exporterdebian/sid+376 -0
integration/configmaster+1 -1
integration/configmaster+7 -0
operations/puppetproduction+32 -0
operations/docker-images/production-imagesmaster+98 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
colewhite triaged this task as Medium priority.May 8 2019, 5:12 PM
colewhite lowered the priority of this task from Medium to Low.

Loki looks like a feasible option to try given the resource constraints on the Grafana VM. It appears there is headroom on the host long as we keep events reasonably low-traffic.

Change 597317 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/docker-images/production-images@master] add loki 1.4.1

https://gerrit.wikimedia.org/r/597317

For what is worth and for kubernetes deploys specifically, we have in grafana an annotation that is working most of the times, but can easily fail us. It's a simple

resets((sum(service_runner_request_duration_seconds_count{service="$service"}))[1m:]) > bool 0

It has at least 2 drawbacks I 've identified in the short time I 've been using it:

  • It is service-runner specific (and in a very specific configuration)
  • Problems with counters being reset multiple times, e.g. when a deploy contains multiple pods that are restarted in timeframes > 1m then 1 deploy == multiple annotations. Which is not ideal.
  • The inverse of the above where if a counter has been reset multiple times in 1m (which is the polling interval of prometheus), we don't catch the deploys either.

An approach I have been thinking about was, since helmfile supports hooks, to have a hook emit a statsd line to a local prometheus-statsd-exporter and then scrape that from prometheus and use it as an annotation. The fact that it is statsd is just an implementation detail of course, it's just the easy to try and do and already tried thing. Since we can run arbitrary commands in that hook [1] (albeit not always having in the environment all the info we would like) we can use other methods of sending that information

[1] https://github.com/roboll/helmfile#hooks

Downside of using helmfile hooks would be that we catch the trigger, not the actual event. So deploys triggered by rollbacks for example would not be recognized. We should maybe ask the kubernetes API, it should know best. :-)

There is a "kube-state-metrics" which exposes a lot of metrics about the state of various API objects. See https://github.com/kubernetes/kube-state-metrics/blob/master/docs/deployment-metrics.md for example.

Change 597317 merged by Cwhite:
[operations/docker-images/production-images@master] add loki 1.5.0

https://gerrit.wikimedia.org/r/597317

Change 602490 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: add loki output support to the logstash pipeline

https://gerrit.wikimedia.org/r/602490

Change 602729 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: add loki_event filter script

https://gerrit.wikimedia.org/r/602729

Change 602730 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[integration/config@master] add filter_scripts volume mount to logstash-filter-verifier job

https://gerrit.wikimedia.org/r/602730

Change 602729 merged by Cwhite:
[operations/puppet@production] profile: add loki_event filter script

https://gerrit.wikimedia.org/r/602729

Change 605343 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] service::docker: enhance volume support

https://gerrit.wikimedia.org/r/605343

Change 602730 merged by jenkins-bot:
[integration/config@master] Add filter_scripts volume mount to logstash-filter-verifier job

https://gerrit.wikimedia.org/r/602730

Change 610119 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] jjb: volume mount for logstash must be absolute path

https://gerrit.wikimedia.org/r/610119

Change 610119 merged by jenkins-bot:
[integration/config@master] jjb: volume mount for logstash must be absolute path

https://gerrit.wikimedia.org/r/610119

Change 616811 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: specify tlsproxy configuration for grafana

https://gerrit.wikimedia.org/r/616811

Change 616851 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] provision loki on grafana-next

https://gerrit.wikimedia.org/r/616851

Change 617250 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/debs/prometheus-es-exporter@debian/sid] debianization

https://gerrit.wikimedia.org/r/617250

Change 617250 merged by Cwhite:
[operations/debs/prometheus-es-exporter@debian/sid] debianization

https://gerrit.wikimedia.org/r/617250

Change 719056 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] puppet_agent_stats: add catalog version to prom metricts

https://gerrit.wikimedia.org/r/719056

In relation to puppet i think we could look again at creating a puppet logstash report. This was never pushed to production do to concerns about sending the full puppet catalogue diff to logstash. however i think we should be able to ensure we only send meta data and not the actual diffs

Change 719056 merged by Jbond:

[operations/puppet@production] puppet_agent_stats: add catalog version to prom metrics

https://gerrit.wikimedia.org/r/719056

Change 719372 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] P:puppetmaster::common: Add back logstash support

https://gerrit.wikimedia.org/r/719372

Change 719368 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] puppetmaster: drop log messages from logstash reporter

https://gerrit.wikimedia.org/r/719368

Change 722580 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/software/ecs@master] git - schema: Add new schema for adding git information

https://gerrit.wikimedia.org/r/722580

Change 722873 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/software/ecs@master] schemas - metrics: Add puppet keys to the metrics name space

https://gerrit.wikimedia.org/r/722873

Change 722580 merged by jenkins-bot:

[operations/software/ecs@master] git - schema: Add new schema for adding git information

https://gerrit.wikimedia.org/r/722580

Change 722873 merged by jenkins-bot:

[operations/software/ecs@master] schemas - metrics: Add puppet keys to the metrics name space

https://gerrit.wikimedia.org/r/722873

Change 719368 merged by Jbond:

[operations/puppet@production] puppetmaster: drop log messages from logstash reporter

https://gerrit.wikimedia.org/r/719368

Change 719372 merged by Jbond:

[operations/puppet@production] P:puppetmaster::common: Add back logstash support

https://gerrit.wikimedia.org/r/719372

Change 734961 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] puppetmaster: enable logstash reports

https://gerrit.wikimedia.org/r/734961

Change 734961 merged by Jbond:

[operations/puppet@production] puppetmaster: enable logstash reports

https://gerrit.wikimedia.org/r/734961

Change 736233 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:rsyslog: ship puppetmaster logs to kafka

https://gerrit.wikimedia.org/r/736233

Change 736233 merged by Jbond:

[operations/puppet@production] P:rsyslog: ship puppetmaster logs to kafka

https://gerrit.wikimedia.org/r/736233

Change 804484 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: ship scap.announce channel to loki

https://gerrit.wikimedia.org/r/804484

Change 806349 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: duplicate alert logs for loki target

https://gerrit.wikimedia.org/r/806349

Change 806430 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: alertmanager use logsource as source for host.name field

https://gerrit.wikimedia.org/r/806430

Change 806430 merged by Cwhite:

[operations/puppet@production] logstash: alertmanager use logsource as source for host.name field

https://gerrit.wikimedia.org/r/806430

Change 809302 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] loki: add loki as an optional grafana component

https://gerrit.wikimedia.org/r/809302

Change 804484 merged by Cwhite:

[operations/puppet@production] logstash: duplicate scap.announce logs for loki target

https://gerrit.wikimedia.org/r/804484

Change 809302 merged by Cwhite:

[operations/puppet@production] loki: add loki as an optional grafana component

https://gerrit.wikimedia.org/r/809302

Change 809706 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] beta-logs: add minimal grafana config

https://gerrit.wikimedia.org/r/809706

Change 809706 merged by Cwhite:

[operations/puppet@production] beta-logs: add minimal grafana config

https://gerrit.wikimedia.org/r/809706

Change 809709 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] loki: add ferm rule to control api access

https://gerrit.wikimedia.org/r/809709

Change 809722 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: add loki output support

https://gerrit.wikimedia.org/r/809722

Change 809709 merged by Cwhite:

[operations/puppet@production] loki: add ferm service to control api access

https://gerrit.wikimedia.org/r/809709

Change 809722 merged by Cwhite:

[operations/puppet@production] logstash: add loki output support

https://gerrit.wikimedia.org/r/809722

Change 810064 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] beta-logs: set loki retention to 3d

https://gerrit.wikimedia.org/r/810064

Change 810064 merged by Cwhite:

[operations/puppet@production] beta-logs: set loki retention to 3d

https://gerrit.wikimedia.org/r/810064

Change 810110 had a related patch set uploaded (by Cwhite; author: Cwhite):

[labs/tools/stashbot@master] Add support for posting events to eventgate

https://gerrit.wikimedia.org/r/810110

Change 810115 had a related patch set uploaded (by Cwhite; author: Cwhite):

[schemas/event/secondary@master] Add logging/sal/1.0.0 schema

https://gerrit.wikimedia.org/r/810115

Change 813715 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] profile: make loki data directory configurable

https://gerrit.wikimedia.org/r/813715

Change 813724 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] hiera: deploy and enable loki on grafana hosts

https://gerrit.wikimedia.org/r/813724

Change 813985 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] loki-beta: increase grpc message size

https://gerrit.wikimedia.org/r/813985

Change 813985 merged by Cwhite:

[operations/puppet@production] loki-beta: increase grpc message size

https://gerrit.wikimedia.org/r/813985

Change 813715 merged by Cwhite:

[operations/puppet@production] profile: make loki data directory configurable

https://gerrit.wikimedia.org/r/813715

Change 813724 merged by Cwhite:

[operations/puppet@production] hiera: deploy and enable loki on grafana hosts

https://gerrit.wikimedia.org/r/813724

Change 814915 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: enable loki public output on production

https://gerrit.wikimedia.org/r/814915

Change 814915 merged by Cwhite:

[operations/puppet@production] logstash: enable loki public output on production

https://gerrit.wikimedia.org/r/814915

colewhite changed the task status from Open to In Progress.Aug 5 2022, 10:38 PM
colewhite added a subscriber: thcipriani.

We've enabled the Public Logs datasource in Grafana and forwarded scap.announce logs to it.

Change 806349 merged by Cwhite:

[operations/puppet@production] logstash: duplicate alert logs for loki target

https://gerrit.wikimedia.org/r/806349

Change 810110 abandoned by Cwhite:

[labs/tools/stashbot@master] Add support for posting events to eventgate

Reason:

https://gerrit.wikimedia.org/r/810110

Change 810115 abandoned by Cwhite:

[schemas/event/secondary@master] Add logging/sal/1.0.0 schema

Reason:

https://gerrit.wikimedia.org/r/810115

MVP achieved. Further iterations and features should come in separately.

Change 602490 abandoned by Cwhite:

[operations/puppet@production] profile: add loki output support to the logstash pipeline

Reason:

in favor of using the loki output plugin

https://gerrit.wikimedia.org/r/602490

Change 605343 abandoned by Cwhite:

[operations/puppet@production] service::docker: enhance volume support

Reason:

we packaged loki in a deb package instead

https://gerrit.wikimedia.org/r/605343

Change 616811 abandoned by Cwhite:

[operations/puppet@production] hiera: specify tlsproxy configuration for grafana

Reason:

https://gerrit.wikimedia.org/r/616811

Change 616851 abandoned by Cwhite:

[operations/puppet@production] provision loki on grafana-next

Reason:

https://gerrit.wikimedia.org/r/616851