Page MenuHomePhabricator

Leverage Grafana annotations to show events in graphs
Open, LowPublic

Description

Investigate possibility of using logging cluster events as grafana annotations (https://grafana.com/docs/reference/annotations/)

Some ideas for possibly useful events:

  • Puppet merges and runs
  • Icinga alerts
  • SAL entries
  • Deploys
  • Downtimes

Event Timeline

colewhite created this task.May 8 2019, 5:11 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 8 2019, 5:11 PM
colewhite triaged this task as Medium priority.May 8 2019, 5:12 PM
colewhite lowered the priority of this task from Medium to Low.
fgiunchedi moved this task from Inbox to Backlog on the observability board.Apr 6 2020, 12:41 PM

Loki looks like a feasible option to try given the resource constraints on the Grafana VM. It appears there is headroom on the host long as we keep events reasonably low-traffic.

Change 597317 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/docker-images/production-images@master] add loki 1.4.1

https://gerrit.wikimedia.org/r/597317

For what is worth and for kubernetes deploys specifically, we have in grafana an annotation that is working most of the times, but can easily fail us. It's a simple

resets((sum(service_runner_request_duration_seconds_count{service="$service"}))[1m:]) > bool 0

It has at least 2 drawbacks I 've identified in the short time I 've been using it:

  • It is service-runner specific (and in a very specific configuration)
  • Problems with counters being reset multiple times, e.g. when a deploy contains multiple pods that are restarted in timeframes > 1m then 1 deploy == multiple annotations. Which is not ideal.
  • The inverse of the above where if a counter has been reset multiple times in 1m (which is the polling interval of prometheus), we don't catch the deploys either.

An approach I have been thinking about was, since helmfile supports hooks, to have a hook emit a statsd line to a local prometheus-statsd-exporter and then scrape that from prometheus and use it as an annotation. The fact that it is statsd is just an implementation detail of course, it's just the easy to try and do and already tried thing. Since we can run arbitrary commands in that hook [1] (albeit not always having in the environment all the info we would like) we can use other methods of sending that information

[1] https://github.com/roboll/helmfile#hooks

Downside of using helmfile hooks would be that we catch the trigger, not the actual event. So deploys triggered by rollbacks for example would not be recognized. We should maybe ask the kubernetes API, it should know best. :-)

There is a "kube-state-metrics" which exposes a lot of metrics about the state of various API objects. See https://github.com/kubernetes/kube-state-metrics/blob/master/docs/deployment-metrics.md for example.

Change 597317 merged by Cwhite:
[operations/docker-images/production-images@master] add loki 1.5.0

https://gerrit.wikimedia.org/r/597317

Change 602490 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: add loki output support to the logstash pipeline

https://gerrit.wikimedia.org/r/602490

Change 602729 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: add loki_event filter script

https://gerrit.wikimedia.org/r/602729

Change 602730 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[integration/config@master] add filter_scripts volume mount to logstash-filter-verifier job

https://gerrit.wikimedia.org/r/602730

fgiunchedi moved this task from Backlog to In progress on the observability board.Mon, Jun 8, 2:16 PM

Change 602729 merged by Cwhite:
[operations/puppet@production] profile: add loki_event filter script

https://gerrit.wikimedia.org/r/602729

Change 605343 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] service::docker: enhance volume support

https://gerrit.wikimedia.org/r/605343

Change 602730 merged by jenkins-bot:
[integration/config@master] Add filter_scripts volume mount to logstash-filter-verifier job

https://gerrit.wikimedia.org/r/602730

Change 610119 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] jjb: volume mount for logstash must be absolute path

https://gerrit.wikimedia.org/r/610119

Change 610119 merged by jenkins-bot:
[integration/config@master] jjb: volume mount for logstash must be absolute path

https://gerrit.wikimedia.org/r/610119