Page MenuHomePhabricator

Prometheus resources in deployment-prep to create grafana graphs of EventLogging
Closed, ResolvedPublic5 Estimated Story Points

Description

We've been trying to create a dashboard to capture Eventlogging events in the beta cluster on grafana to help sign off T202026. Strangely, we were not seeing any traffic to schemas that we knew were getting traffic.

I went to analytics to investigate and had a productive chat with @Ottomata :

<ottomata> Andrew Otto looking
2:16 PM OH
2:17 PM hmm
2:18 PM yeah, its not enabled in labs
2:19 PM the prometheus exporters need to be explicitly declared
2:19 PM <jdlrobson> Jon Robson is that easy to fix?
2:19 PM <ottomata> Andrew Otto i'm not sure...
2:19 PM stil lloking
2:20 PM no
2:20 PM can't fix.
2:20 PM it uses exported resources
2:20 PM which aren't availabe for labs puppet
2:22 PM jdlrobson:  yeah, dunno this is not an easy one.  if you want a fix, you probably need a ticket with ops, tag filippo
2:22 PM puppet prometheus exporters dont' work in labs because exported puppet resources are not queryable
2:22 PM in labs, because of security reasons (cross project puppet stuff i think)
2:24 PM ok i gotta run, sorry i couldn't help more than that
2:24 PM <jdlrobson> Jon Robson :( ok will do! will add you to help tweak the wording
2:24 PM day after tomorrow problem!

Per his suggestion I'm pinging @fgiunchedi

Event Timeline

Indeed that's what's going on due to lack of exported resources. I don't know though if we could enable exported resources for the beta cluster only? cc Beta-Cluster-Infrastructure

For additional context: what we need in this case for Prometheus to work is a list of host:port pairs for Prometheus to poll metrics from. For host-level metrics in labs (i.e. generated by prometheus-node-exporter) said list is generated from nova's metadata service by listing all instances in a project. In this case we'd need a list of all host:port pairs running the exporter that generates metrics about EL, in production this happens via exported resources.

I'm guessing this is not something trivial we can fix?
Is there another task I should be following relating to this?

Our goal right now is to detect client side errors being thrown on the beta cluster before they can hit production, so having this would be very useful to this (and i'm sure a variety of other use cases).

"which aren't availabe for labs puppet"

what? We have exported resources working fine within the deployment-prep project as far as I am aware?

@Krenair oh ya? Maybe I just didn't know that! So query_resources will work?! If so I can probably just enable the prometheus based kafka monitoring in deployment-prep.

I think so - I'm pretty sure that's the mechanism that ssh known hosts is using there. Give it a go and let me know if you run into any issues?

Change 462567 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Scrape Kafka jmx exporters in deployment-prep

https://gerrit.wikimedia.org/r/462567

I think ^ is what is needed (sorry was at our offsite last week!). Hopefully @fgiunchedi can confirm!

Change 462567 merged by Ottomata:
[operations/puppet@production] Scrape Kafka jmx exporters in deployment-prep

https://gerrit.wikimedia.org/r/462567

@Ottomata, @fgiunchedi hello!

We're still fiddling with our dashboard but not seeing anything show up. For the dashboard templating, we use the "Beta Prometheus" data source and jumbo-deployment-prep cluster: label_values(kafka_server_BrokerTopicMetrics_MessagesIn_total{kafka_cluster="jumbo-deployment-prep"}, topic). We do the same for the dashboard graph: irate(kafka_server_BrokerTopicMetrics_MessagesIn_total{kafka_cluster="jumbo-deployment-prep", topic=~"eventlogging_$schema"}[5m]). Do you know if these values correct?

They should be. Something is not working though. The prometheus server is not getting any results for the configured resource queries in beta. Not sure why, hoping to get some help from Filippo to troubleshoot.

BTW, I updated https://wikitech.wikimedia.org/wiki/Prometheus#Access_Prometheus_web_interface with instructions on how to access the Prometheus web interface in deployment-prep. That makes troubleshooting prometheus metrics much easier than doing it while building grafana dashboards.

(for context: modules/prometheus/manifests/jmx_exporter_config.pp / modules/prometheus/templates/jmx_exporter_config.erb use the get_clusters function before looking at the puppetdb results for prometheus stuff)
<Krenair> ottomata, found the problem
<Krenair> get_clusters checks everything with class profile::cumin::target
<Krenair> but guess what includes that
<Krenair> this block at the top of modules/standard/manifests/init.pp
<Krenair> if $::realm == 'production' {
<Krenair> include ::profile::cumin::target

With these puppet changes:

diff --git a/modules/profile/manifests/openstack/main/cumin/target.pp b/modules/profile/manifests/openstack/main/cumin/target.pp
index 0c85b04c6c..17e2f09d7c 100644
--- a/modules/profile/manifests/openstack/main/cumin/target.pp
+++ b/modules/profile/manifests/openstack/main/cumin/target.pp
@@ -15,6 +15,8 @@ class profile::openstack::main::cumin::target(
     $auth_group = hiera('profile::openstack::main::cumin::auth_group'),
     $project_masters = hiera('profile::openstack::main::cumin::project_masters'),
     $project_pub_key = hiera('profile::openstack::main::cumin::project_pub_key'),
+    $cluster = hiera('cluster', 'misc'),
+    $site = $::site,  # lint:ignore:wmf_styleguide
 ) {
     require ::network::constants
 
diff --git a/modules/wmflib/lib/puppet/parser/functions/get_clusters.rb b/modules/wmflib/lib/puppet/parser/functions/get_clusters.rb
index 54dcec379d..ac428a8696 100644
--- a/modules/wmflib/lib/puppet/parser/functions/get_clusters.rb
+++ b/modules/wmflib/lib/puppet/parser/functions/get_clusters.rb
@@ -41,10 +41,10 @@ module Puppet::Parser::Functions
       sites = false
     end
 
-    function_query_resources([false, 'Class["Profile::Cumin::Target"]', false])
+    function_query_resources([false, 'Class["Profile::Cumin::Target"] or Class["Profile::Openstack::Main::Cumin::Target"]', false])
       .sort_by{ |n| n['certname'] }.each do |node|
-      cluster = node['parameters']['cluster']
-      site = node['parameters']['site']
+      cluster = node['parameters']['cluster'] || 'misc'
+      site = node['parameters']['site'] || 'eqiad'
       fqdn = node['certname']
       next unless clusters.include?cluster
       next if sites && !sites.include?(site)

I've got it to do this:

krenair@deployment-prometheus01:~$ sudo cat /srv/prometheus/beta/targets/jmx_kafka_mirrormaker_beta_eqiad.yaml
# This file is managed by puppet
- labels:
    cluster: misc
    mirror_name: main-deployment-prep_to_jumbo-deployment-prep
  targets:
  - deployment-kafka-jumbo-1:7900
  - deployment-kafka-jumbo-2:7900

the || 'misc' and || 'eqiad' lines can probably go back once all the deployment-prep hosts have got some values set. Note I don't think the 'cluster' hieradata is really populated.

Change 462810 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] Try to make get_clusters work inside labs

https://gerrit.wikimedia.org/r/462810

@fgiunchedi Alex's change ^ should do it. Could you +1 also?

Krenair renamed this task from exported puppet resources are not queryable: cannot create grafana graphs of EventLogging running in beta cluster to Prometheus resources in deployment-prep to create grafana graphs of EventLogging.Sep 25 2018, 9:14 PM

BTW, I updated https://wikitech.wikimedia.org/wiki/Prometheus#Access_Prometheus_web_interface with instructions on how to access the Prometheus web interface in deployment-prep. That makes troubleshooting prometheus metrics much easier than doing it while building grafana dashboards.

FWIW the beta Prometheus instance is available at https://beta-prometheus.wmflabs.org/beta/graph no need to forward that one (though getting LDAP+https access for production Prometheus is on my radar!)

@fgiunchedi Alex's change ^ should do it. Could you +1 also?

Yes I'll take a look!

I didn't realize that either! Documenting.

We're seeing Schemas show up now (YAY!) but still no events showing up there.

Is the hope that https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/462810/ takes care of that, or is something else needed?
Thanks for looking into this. It's really appreciated <3

Change 462810 merged by Ottomata:
[operations/puppet@production] Introduce cumin::selector dummy class

https://gerrit.wikimedia.org/r/462810

Change 463966 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use cumin::selector instead of profile::cumin::target in get_clusters

https://gerrit.wikimedia.org/r/463966

@Ottomata still not seeing them.. does that mean https://gerrit.wikimedia.org/r/462810 didn't work (sorry if this is a novice question but I'm not familiar with puppet)

There are two changes needed, including https://gerrit.wikimedia.org/r/463966. I need ops review before I merge, as this has the potential to touch all production hosts.

Change 463966 merged by Filippo Giunchedi:
[operations/puppet@production] Use cumin::selector instead of profile::cumin::target in get_clusters

https://gerrit.wikimedia.org/r/463966

Ottomata moved this task from Next Up to Done on the Analytics-Kanban board.
Ottomata set the point value for this task to 5.

Thats awesome!!!!🎉 🎉 🎉 🎉 🎉 🎉
@Thank you @Ottomata @Krenair and @fgiunchedi this is going to be super useful for the team!