Leverage Grafana annotations to show events in graphs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	colewhite
	May 8 2019, 5:11 PM

Description

Investigate possibility of using logging cluster events as grafana annotations (https://grafana.com/docs/reference/annotations/)

Some ideas for possibly useful events:

Puppet merges and runs
Icinga alerts
SAL entries
Deploys
Downtimes

Details

Subject	Repo	Branch	Lines +/-
provision loki on grafana-next	operations/puppet	production	+110 -0
hiera: specify tlsproxy configuration for grafana	operations/puppet	production	+25 -3
service::docker: enhance volume support	operations/puppet	production	+24 -11
profile: add loki output support to the logstash pipeline	operations/puppet	production	+207 -0
Add logging/sal/1.0.0 schema	schemas/event/secondary	master	+222 -0
Add support for posting events to eventgate	labs/tools/stashbot	master	+21 -0
logstash: duplicate alert logs for loki target	operations/puppet	production	+119 -1
logstash: enable loki public output on production	operations/puppet	production	+14 -2
hiera: deploy and enable loki on grafana hosts	operations/puppet	production	+54 -0
profile: make loki data directory configurable	operations/puppet	production	+4 -1
loki-beta: increase grpc message size	operations/puppet	production	+6 -0
beta-logs: set loki retention to 3d	operations/puppet	production	+4 -0
logstash: add loki output support	operations/puppet	production	+62 -2
loki: add ferm service to control api access	operations/puppet	production	+15 -3
beta-logs: add minimal grafana config	operations/puppet	production	+5 -0
loki: add loki as an optional grafana component	operations/puppet	production	+103 -27
logstash: duplicate scap.announce logs for loki target	operations/puppet	production	+88 -0
logstash: alertmanager use logsource as source for host.name field	operations/puppet	production	+6 -1
P:rsyslog: ship puppetmaster logs to kafka	operations/puppet	production	+2 -1
puppetmaster: enable logstash reports	operations/puppet	production	+6 -0
P:puppetmaster::common: Add back logstash support	operations/puppet	production	+12 -37
puppetmaster: drop log messages from logstash reporter	operations/puppet	production	+81 -50
schemas - metrics: Add puppet keys to the metrics name space	operations/software/ecs	master	+56 -0
git - schema: Add new schema for adding git information	operations/software/ecs	master	+51 -1
puppet_agent_stats: add catalog version to prom metrics	operations/puppet	production	+9 -1
debianization	operations/debs/prometheus-es-exporter	debian/sid	+376 -0
jjb: volume mount for logstash must be absolute path	integration/config	master	+1 -1
Add filter_scripts volume mount to logstash-filter-verifier job	integration/config	master	+7 -0
profile: add loki_event filter script	operations/puppet	production	+32 -0
add loki 1.5.0	operations/docker-images/production-images	master	+98 -0

Related Objects
Search...

Status	Assigned	Task
Resolved	colewhite	T222826 Leverage Grafana annotations to show events in graphs
Resolved	colewhite	T174172 unused grafana-dashboard indices on elasticsearch / logstash
Resolved	akosiaris	T257226 Please create operations/debs/grafana-loki gerrit repository
Resolved	colewhite	T257861 Pipe SAL entries into Logstash
Open	None	T350825 Loki: add a channel(s) for git commits

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 8 2019, 5:11 PM

colewhite triaged this task as Medium priority.May 8 2019, 5:12 PM

colewhite lowered the priority of this task from Medium to Low.

colewhite added a subtask: T174172: unused grafana-dashboard indices on elasticsearch / logstash.

Volans mentioned this in T223934: Add annotations from ops vendor maintenance calendar to Grafana.Jan 29 2020, 2:59 PM

fgiunchedi moved this task from Inbox to Backlog on the observability board.Apr 6 2020, 12:41 PM

colewhite added a subtask: T223934: Add annotations from ops vendor maintenance calendar to Grafana.May 15 2020, 4:03 PM

Loki looks like a feasible option to try given the resource constraints on the Grafana VM. It appears there is headroom on the host long as we keep events reasonably low-traffic.

colewhite claimed this task.May 15 2020, 4:11 PM

Change 597317 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/docker-images/production-images@master] add loki 1.4.1

https://gerrit.wikimedia.org/r/597317

gerritbot added a project: Patch-For-Review.May 19 2020, 5:45 PM

akosiaris subscribed.May 20 2020, 1:57 PM

For what is worth and for kubernetes deploys specifically, we have in grafana an annotation that is working most of the times, but can easily fail us. It's a simple

resets((sum(service_runner_request_duration_seconds_count{service="$service"}))[1m:]) > bool 0

It has at least 2 drawbacks I 've identified in the short time I 've been using it:

It is service-runner specific (and in a very specific configuration)
Problems with counters being reset multiple times, e.g. when a deploy contains multiple pods that are restarted in timeframes > 1m then 1 deploy == multiple annotations. Which is not ideal.
The inverse of the above where if a counter has been reset multiple times in 1m (which is the polling interval of prometheus), we don't catch the deploys either.

An approach I have been thinking about was, since helmfile supports hooks, to have a hook emit a statsd line to a local prometheus-statsd-exporter and then scrape that from prometheus and use it as an annotation. The fact that it is statsd is just an implementation detail of course, it's just the easy to try and do and already tried thing. Since we can run arbitrary commands in that hook [1] (albeit not always having in the environment all the info we would like) we can use other methods of sending that information

[1] https://github.com/roboll/helmfile#hooks

Downside of using helmfile hooks would be that we catch the trigger, not the actual event. So deploys triggered by rollbacks for example would not be recognized. We should maybe ask the kubernetes API, it should know best. :-)

There is a "kube-state-metrics" which exposes a lot of metrics about the state of various API objects. See https://github.com/kubernetes/kube-state-metrics/blob/master/docs/deployment-metrics.md for example.

Change 597317 merged by Cwhite:
[operations/docker-images/production-images@master] add loki 1.5.0

https://gerrit.wikimedia.org/r/597317

Change 602490 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: add loki output support to the logstash pipeline

https://gerrit.wikimedia.org/r/602490

Change 602729 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: add loki_event filter script

https://gerrit.wikimedia.org/r/602729

Change 602730 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[integration/config@master] add filter_scripts volume mount to logstash-filter-verifier job

https://gerrit.wikimedia.org/r/602730

fgiunchedi moved this task from Backlog to In progress on the observability board.Jun 8 2020, 2:16 PM

Change 602729 merged by Cwhite:
[operations/puppet@production] profile: add loki_event filter script

https://gerrit.wikimedia.org/r/602729

Change 605343 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] service::docker: enhance volume support

https://gerrit.wikimedia.org/r/605343

colewhite added a subtask: T257226: Please create operations/debs/grafana-loki gerrit repository.Jul 6 2020, 4:40 PM

Change 602730 merged by jenkins-bot:
[integration/config@master] Add filter_scripts volume mount to logstash-filter-verifier job

https://gerrit.wikimedia.org/r/602730

Change 610119 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] jjb: volume mount for logstash must be absolute path

https://gerrit.wikimedia.org/r/610119

Change 610119 merged by jenkins-bot:
[integration/config@master] jjb: volume mount for logstash must be absolute path

https://gerrit.wikimedia.org/r/610119

akosiaris closed subtask T257226: Please create operations/debs/grafana-loki gerrit repository as Resolved.Jul 8 2020, 10:44 AM

colewhite mentioned this in T257861: Pipe SAL entries into Logstash.Jul 13 2020, 5:18 PM

colewhite added a subtask: T257861: Pipe SAL entries into Logstash.

Change 616811 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: specify tlsproxy configuration for grafana

https://gerrit.wikimedia.org/r/616811

Change 616851 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] provision loki on grafana-next

https://gerrit.wikimedia.org/r/616851

Change 617250 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/debs/prometheus-es-exporter@debian/sid] debianization

https://gerrit.wikimedia.org/r/617250

Change 617250 merged by Cwhite:
[operations/debs/prometheus-es-exporter@debian/sid] debianization

https://gerrit.wikimedia.org/r/617250

JMeybohm mentioned this in T264625: Deploy kube-state-metrics.Oct 5 2020, 2:39 PM

lmata moved this task from In progress to Epics In Progress on the observability board.Jun 14 2021, 3:44 PM

lmata edited projects, added SRE Observability (FY2021/2022-Q1); removed observability.Jul 12 2021, 2:41 AM

lmata moved this task from Inbox to Epics In Progress on the SRE Observability (FY2021/2022-Q1) board.

Change 719056 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] puppet_agent_stats: add catalog version to prom metricts

https://gerrit.wikimedia.org/r/719056

In relation to puppet i think we could look again at creating a puppet logstash report. This was never pushed to production do to concerns about sending the full puppet catalogue diff to logstash. however i think we should be able to ensure we only send meta data and not the actual diffs

Change 719056 merged by Jbond:

[operations/puppet@production] puppet_agent_stats: add catalog version to prom metrics

https://gerrit.wikimedia.org/r/719056

Change 719372 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] P:puppetmaster::common: Add back logstash support

https://gerrit.wikimedia.org/r/719372

Change 719368 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] puppetmaster: drop log messages from logstash reporter

https://gerrit.wikimedia.org/r/719368

Change 722580 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/software/ecs@master] git - schema: Add new schema for adding git information

https://gerrit.wikimedia.org/r/722580

Change 722873 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/software/ecs@master] schemas - metrics: Add puppet keys to the metrics name space

https://gerrit.wikimedia.org/r/722873

colewhite edited projects, added Observability-Logging; removed SRE Observability (FY2021/2022-Q1).Oct 1 2021, 12:19 AM

Change 722580 merged by jenkins-bot:

[operations/software/ecs@master] git - schema: Add new schema for adding git information

https://gerrit.wikimedia.org/r/722580

jbond mentioned this in rOSECb548ed08163d: git - schema: Add new schema for adding git information.Oct 5 2021, 3:15 AM

Change 722873 merged by jenkins-bot:

[operations/software/ecs@master] schemas - metrics: Add puppet keys to the metrics name space

https://gerrit.wikimedia.org/r/722873

jbond mentioned this in rOSEC2dde2a50528e: schemas - metrics: Add puppet keys to the metrics name space.Oct 12 2021, 11:02 PM

Change 719368 merged by Jbond:

[operations/puppet@production] puppetmaster: drop log messages from logstash reporter

https://gerrit.wikimedia.org/r/719368

Change 719372 merged by Jbond:

[operations/puppet@production] P:puppetmaster::common: Add back logstash support

https://gerrit.wikimedia.org/r/719372

Change 734961 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] puppetmaster: enable logstash reports

https://gerrit.wikimedia.org/r/734961

Change 734961 merged by Jbond:

[operations/puppet@production] puppetmaster: enable logstash reports

https://gerrit.wikimedia.org/r/734961

Change 736233 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:rsyslog: ship puppetmaster logs to kafka

https://gerrit.wikimedia.org/r/736233

Change 736233 merged by Jbond:

[operations/puppet@production] P:rsyslog: ship puppetmaster logs to kafka

https://gerrit.wikimedia.org/r/736233

colewhite closed subtask T174172: unused grafana-dashboard indices on elasticsearch / logstash as Resolved.Feb 16 2022, 6:33 PM

Change 804484 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: ship scap.announce channel to loki

https://gerrit.wikimedia.org/r/804484

Change 806349 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: duplicate alert logs for loki target

https://gerrit.wikimedia.org/r/806349

Change 806430 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: alertmanager use logsource as source for host.name field

https://gerrit.wikimedia.org/r/806430

Change 806430 merged by Cwhite:

[operations/puppet@production] logstash: alertmanager use logsource as source for host.name field

https://gerrit.wikimedia.org/r/806430

Change 809302 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] loki: add loki as an optional grafana component

https://gerrit.wikimedia.org/r/809302

Change 804484 merged by Cwhite:

[operations/puppet@production] logstash: duplicate scap.announce logs for loki target

https://gerrit.wikimedia.org/r/804484

Change 809302 merged by Cwhite:

[operations/puppet@production] loki: add loki as an optional grafana component

https://gerrit.wikimedia.org/r/809302

Change 809706 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] beta-logs: add minimal grafana config

https://gerrit.wikimedia.org/r/809706

Change 809706 merged by Cwhite:

[operations/puppet@production] beta-logs: add minimal grafana config

https://gerrit.wikimedia.org/r/809706

Change 809709 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] loki: add ferm rule to control api access

https://gerrit.wikimedia.org/r/809709

Change 809722 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: add loki output support

https://gerrit.wikimedia.org/r/809722

Change 809709 merged by Cwhite:

[operations/puppet@production] loki: add ferm service to control api access

https://gerrit.wikimedia.org/r/809709

Change 809722 merged by Cwhite:

[operations/puppet@production] logstash: add loki output support

https://gerrit.wikimedia.org/r/809722

Change 810064 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] beta-logs: set loki retention to 3d

https://gerrit.wikimedia.org/r/810064

Change 810064 merged by Cwhite:

[operations/puppet@production] beta-logs: set loki retention to 3d

https://gerrit.wikimedia.org/r/810064

Change 810110 had a related patch set uploaded (by Cwhite; author: Cwhite):

[labs/tools/stashbot@master] Add support for posting events to eventgate

https://gerrit.wikimedia.org/r/810110

Change 810115 had a related patch set uploaded (by Cwhite; author: Cwhite):

[schemas/event/secondary@master] Add logging/sal/1.0.0 schema

https://gerrit.wikimedia.org/r/810115

Change 813715 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] profile: make loki data directory configurable

https://gerrit.wikimedia.org/r/813715

Change 813724 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] hiera: deploy and enable loki on grafana hosts

https://gerrit.wikimedia.org/r/813724

Change 813985 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] loki-beta: increase grpc message size

https://gerrit.wikimedia.org/r/813985

Change 813985 merged by Cwhite:

[operations/puppet@production] loki-beta: increase grpc message size

https://gerrit.wikimedia.org/r/813985

Change 813715 merged by Cwhite:

[operations/puppet@production] profile: make loki data directory configurable

https://gerrit.wikimedia.org/r/813715

Change 813724 merged by Cwhite:

[operations/puppet@production] hiera: deploy and enable loki on grafana hosts

https://gerrit.wikimedia.org/r/813724

Change 814915 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: enable loki public output on production

https://gerrit.wikimedia.org/r/814915

Change 814915 merged by Cwhite:

[operations/puppet@production] logstash: enable loki public output on production

https://gerrit.wikimedia.org/r/814915

colewhite moved this task from Inbox to Prioritized on the Observability-Logging board.Jul 20 2022, 9:11 PM

We've enabled the Public Logs datasource in Grafana and forwarded scap.announce logs to it.

colewhite changed the status of subtask T257861: Pipe SAL entries into Logstash from Open to In Progress.Aug 11 2022, 4:17 PM

Change 806349 merged by Cwhite:

[operations/puppet@production] logstash: duplicate alert logs for loki target

https://gerrit.wikimedia.org/r/806349

colewhite closed subtask T257861: Pipe SAL entries into Logstash as Resolved.Aug 24 2022, 4:27 PM

Change 810110 abandoned by Cwhite:

[labs/tools/stashbot@master] Add support for posting events to eventgate

Reason:

https://gerrit.wikimedia.org/r/810110

Change 810115 abandoned by Cwhite:

[schemas/event/secondary@master] Add logging/sal/1.0.0 schema

Reason:

https://gerrit.wikimedia.org/r/810115

colewhite removed a subtask: T223934: Add annotations from ops vendor maintenance calendar to Grafana.Sep 21 2022, 10:42 AM

MVP achieved. Further iterations and features should come in separately.

Change 602490 abandoned by Cwhite:

[operations/puppet@production] profile: add loki output support to the logstash pipeline

Reason:

in favor of using the loki output plugin

https://gerrit.wikimedia.org/r/602490

Change 605343 abandoned by Cwhite:

[operations/puppet@production] service::docker: enhance volume support

Reason:

we packaged loki in a deb package instead

https://gerrit.wikimedia.org/r/605343

Change 616811 abandoned by Cwhite:

[operations/puppet@production] hiera: specify tlsproxy configuration for grafana

Reason:

https://gerrit.wikimedia.org/r/616811

Change 616851 abandoned by Cwhite:

[operations/puppet@production] provision loki on grafana-next

Reason:

https://gerrit.wikimedia.org/r/616851

Maintenance_bot removed a project: Patch-For-Review.Apr 14 2023, 10:10 PM

herron added a subtask: T350825: Loki: add a channel(s) for git commits.Nov 8 2023, 7:32 PM

Leverage Grafana annotations to show events in graphsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Leverage Grafana annotations to show events in graphs
Closed, ResolvedPublic
Actions

Related Objects
Search...