Maniphest T204088

Prometheus resources in deployment-prep to create grafana graphs of EventLogging
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	Jdlrobson
	Sep 11 2018, 9:36 PM

Description

We've been trying to create a dashboard to capture Eventlogging events in the beta cluster on grafana to help sign off T202026. Strangely, we were not seeing any traffic to schemas that we knew were getting traffic.

I went to analytics to investigate and had a productive chat with @Ottomata :

<ottomata> Andrew Otto looking
2:16 PM OH
2:17 PM hmm
2:18 PM yeah, its not enabled in labs
2:19 PM the prometheus exporters need to be explicitly declared
2:19 PM <jdlrobson> Jon Robson is that easy to fix?
2:19 PM <ottomata> Andrew Otto i'm not sure...
2:19 PM stil lloking
2:20 PM no
2:20 PM can't fix.
2:20 PM it uses exported resources
2:20 PM which aren't availabe for labs puppet
2:22 PM jdlrobson:  yeah, dunno this is not an easy one.  if you want a fix, you probably need a ticket with ops, tag filippo
2:22 PM puppet prometheus exporters dont' work in labs because exported puppet resources are not queryable
2:22 PM in labs, because of security reasons (cross project puppet stuff i think)
2:24 PM ok i gotta run, sorry i couldn't help more than that
2:24 PM <jdlrobson> Jon Robson :( ok will do! will add you to help tweak the wording
2:24 PM day after tomorrow problem!

Per his suggestion I'm pinging @fgiunchedi

Details

Subject	Repo	Branch	Lines +/-
Use cumin::selector instead of profile::cumin::target in get_clusters	operations/puppet	production	+1 -1
Introduce cumin::selector dummy class	operations/puppet	production	+26 -0
Scrape Kafka jmx exporters in deployment-prep	operations/puppet	production	+36 -1

Customize query in gerrit

Related Objects

Mentioned In: T211640: Grafana, icinga, prometheus in cloud-analytics project
T203814: Turn on MinervaErrorLogSamplingRate (Schema:WebClientError)
Mentioned Here: T202026: Report client-side JavaScript errors in MobileFrontend practically

Event Timeline

Jdlrobson created this task.Sep 11 2018, 9:36 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 11 2018, 9:36 PM

Jdlrobson added a project: Web-Team-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q1).Sep 11 2018, 9:36 PM

Jdlrobson moved this task from To Do to Blocked on Others on the Web-Team-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q1) board.

Indeed that's what's going on due to lack of exported resources. I don't know though if we could enable exported resources for the beta cluster only? cc Beta-Cluster-Infrastructure

For additional context: what we need in this case for Prometheus to work is a list of host:port pairs for Prometheus to poll metrics from. For host-level metrics in labs (i.e. generated by prometheus-node-exporter) said list is generated from nova's metadata service by listing all instances in a project. In this case we'd need a list of all host:port pairs running the exporter that generates metrics about EL, in production this happens via exported resources.

Ottomata added a subscriber: elukey.Sep 12 2018, 3:51 PM

Jdlrobson updated the task description. (Show Details)Sep 12 2018, 4:13 PM

Jdlrobson added a project: Beta-Cluster-Infrastructure.Sep 17 2018, 6:18 PM

I'm guessing this is not something trivial we can fix?
Is there another task I should be following relating to this?

Our goal right now is to detect client side errors being thrown on the beta cluster before they can hit production, so having this would be very useful to this (and i'm sure a variety of other use cases).

"which aren't availabe for labs puppet"

what? We have exported resources working fine within the deployment-prep project as far as I am aware?

@Krenair oh ya? Maybe I just didn't know that! So query_resources will work?! If so I can probably just enable the prometheus based kafka monitoring in deployment-prep.

I think so - I'm pretty sure that's the mechanism that ssh known hosts is using there. Give it a go and let me know if you run into any issues?

Any luck @Ottomata ?

Ottomata added a project: Analytics-Kanban.Sep 24 2018, 8:00 PM

Change 462567 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Scrape Kafka jmx exporters in deployment-prep

https://gerrit.wikimedia.org/r/462567

gerritbot added a project: Patch-For-Review.Sep 24 2018, 8:00 PM

I think ^ is what is needed (sorry was at our offsite last week!). Hopefully @fgiunchedi can confirm!

Change 462567 merged by Ottomata:
[operations/puppet@production] Scrape Kafka jmx exporters in deployment-prep

https://gerrit.wikimedia.org/r/462567

@Ottomata, @fgiunchedi hello!

We're still fiddling with our dashboard but not seeing anything show up. For the dashboard templating, we use the "Beta Prometheus" data source and jumbo-deployment-prep cluster: label_values(kafka_server_BrokerTopicMetrics_MessagesIn_total{kafka_cluster="jumbo-deployment-prep"}, topic). We do the same for the dashboard graph: irate(kafka_server_BrokerTopicMetrics_MessagesIn_total{kafka_cluster="jumbo-deployment-prep", topic=~"eventlogging_$schema"}[5m]). Do you know if these values correct?

They should be. Something is not working though. The prometheus server is not getting any results for the configured resource queries in beta. Not sure why, hoping to get some help from Filippo to troubleshoot.

BTW, I updated https://wikitech.wikimedia.org/wiki/Prometheus#Access_Prometheus_web_interface with instructions on how to access the Prometheus web interface in deployment-prep. That makes troubleshooting prometheus metrics much easier than doing it while building grafana dashboards.

(for context: modules/prometheus/manifests/jmx_exporter_config.pp / modules/prometheus/templates/jmx_exporter_config.erb use the get_clusters function before looking at the puppetdb results for prometheus stuff)
<Krenair> ottomata, found the problem
<Krenair> get_clusters checks everything with class profile::cumin::target
<Krenair> but guess what includes that
<Krenair> this block at the top of modules/standard/manifests/init.pp
<Krenair> if $::realm == 'production' {
<Krenair> include ::profile::cumin::target

With these puppet changes:

diff --git a/modules/profile/manifests/openstack/main/cumin/target.pp b/modules/profile/manifests/openstack/main/cumin/target.pp
index 0c85b04c6c..17e2f09d7c 100644
--- a/modules/profile/manifests/openstack/main/cumin/target.pp
+++ b/modules/profile/manifests/openstack/main/cumin/target.pp
@@ -15,6 +15,8 @@ class profile::openstack::main::cumin::target(
     $auth_group = hiera('profile::openstack::main::cumin::auth_group'),
     $project_masters = hiera('profile::openstack::main::cumin::project_masters'),
     $project_pub_key = hiera('profile::openstack::main::cumin::project_pub_key'),
+    $cluster = hiera('cluster', 'misc'),
+    $site = $::site,  # lint:ignore:wmf_styleguide
 ) {
     require ::network::constants
 
diff --git a/modules/wmflib/lib/puppet/parser/functions/get_clusters.rb b/modules/wmflib/lib/puppet/parser/functions/get_clusters.rb
index 54dcec379d..ac428a8696 100644
--- a/modules/wmflib/lib/puppet/parser/functions/get_clusters.rb
+++ b/modules/wmflib/lib/puppet/parser/functions/get_clusters.rb
@@ -41,10 +41,10 @@ module Puppet::Parser::Functions
       sites = false
     end
 
-    function_query_resources([false, 'Class["Profile::Cumin::Target"]', false])
+    function_query_resources([false, 'Class["Profile::Cumin::Target"] or Class["Profile::Openstack::Main::Cumin::Target"]', false])
       .sort_by{ |n| n['certname'] }.each do |node|
-      cluster = node['parameters']['cluster']
-      site = node['parameters']['site']
+      cluster = node['parameters']['cluster'] || 'misc'
+      site = node['parameters']['site'] || 'eqiad'
       fqdn = node['certname']
       next unless clusters.include?cluster
       next if sites && !sites.include?(site)

I've got it to do this:

krenair@deployment-prometheus01:~$ sudo cat /srv/prometheus/beta/targets/jmx_kafka_mirrormaker_beta_eqiad.yaml
# This file is managed by puppet
- labels:
    cluster: misc
    mirror_name: main-deployment-prep_to_jumbo-deployment-prep
  targets:
  - deployment-kafka-jumbo-1:7900
  - deployment-kafka-jumbo-2:7900

the || 'misc' and || 'eqiad' lines can probably go back once all the deployment-prep hosts have got some values set. Note I don't think the 'cluster' hieradata is really populated.

Change 462810 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] Try to make get_clusters work inside labs

https://gerrit.wikimedia.org/r/462810

@fgiunchedi Alex's change ^ should do it. Could you +1 also?

Krenair renamed this task from exported puppet resources are not queryable: cannot create grafana graphs of EventLogging running in beta cluster to Prometheus resources in deployment-prep to create grafana graphs of EventLogging.Sep 25 2018, 9:14 PM

In T204088#4616379, @Ottomata wrote:

BTW, I updated https://wikitech.wikimedia.org/wiki/Prometheus#Access_Prometheus_web_interface with instructions on how to access the Prometheus web interface in deployment-prep. That makes troubleshooting prometheus metrics much easier than doing it while building grafana dashboards.

FWIW the beta Prometheus instance is available at https://beta-prometheus.wmflabs.org/beta/graph no need to forward that one (though getting LDAP+https access for production Prometheus is on my radar!)

In T204088#4616931, @Ottomata wrote:

@fgiunchedi Alex's change ^ should do it. Could you +1 also?

Yes I'll take a look!

I didn't realize that either! Documenting.

Jdlrobson mentioned this in T203814: Turn on MinervaErrorLogSamplingRate (Schema:WebClientError).Sep 26 2018, 9:05 PM

We're seeing Schemas show up now (YAY!) but still no events showing up there.

Is the hope that https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/462810/ takes care of that, or is something else needed?
Thanks for looking into this. It's really appreciated <3

MoritzMuehlenhoff triaged this task as Medium priority.Sep 28 2018, 9:40 AM

Is the hope that https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/462810/ takes care of that

yup!

Change 462810 merged by Ottomata:
[operations/puppet@production] Introduce cumin::selector dummy class

https://gerrit.wikimedia.org/r/462810

Change 463966 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use cumin::selector instead of profile::cumin::target in get_clusters

https://gerrit.wikimedia.org/r/463966

ovasileva moved this task from Readers-Web-Kanbanana-Board-2018-19-Q1 to Readers-Web-Kanbanana-Board-2018-19-Q2 on the Web-Team-Backlog board.Oct 2 2018, 5:29 PM

ovasileva edited projects, added Web-Team-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2); removed Web-Team-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q1).

ovasileva moved this task from To Do to Blocked on Others on the Web-Team-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2) board.Oct 2 2018, 5:36 PM

@Ottomata still not seeing them.. does that mean https://gerrit.wikimedia.org/r/462810 didn't work (sorry if this is a novice question but I'm not familiar with puppet)

• Tbayer subscribed.Oct 3 2018, 2:49 AM

There are two changes needed, including https://gerrit.wikimedia.org/r/463966. I need ops review before I merge, as this has the potential to touch all production hosts.

Change 463966 merged by Filippo Giunchedi:
[operations/puppet@production] Use cumin::selector instead of profile::cumin::target in get_clusters

https://gerrit.wikimedia.org/r/463966

FINALLY GOT IT!

https://beta-prometheus.wmflabs.org/beta/graph?g0.range_input=1h&g0.expr=kafka_server_BrokerTopicMetrics_MessagesIn_total%7Bkafka_cluster%3D%22jumbo-deployment-prep%22%2C+topic%3D~%22eventlogging_.*%22%7D&g0.tab=1

Ottomata claimed this task.Oct 3 2018, 5:38 PM

Ottomata moved this task from Next Up to Done on the Analytics-Kanban board.

Ottomata set the point value for this task to 5.

Thats awesome!!!!🎉 🎉 🎉 🎉 🎉 🎉
@Thank you @Ottomata @Krenair and @fgiunchedi this is going to be super useful for the team!

Jdlrobson moved this task from Needs QA to Ready for Signoff on the Web-Team-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2) board.Oct 3 2018, 5:48 PM

Krenair moved this task from To Triage to Backlog on the Beta-Cluster-Infrastructure board.Oct 3 2018, 5:49 PM

Krenair moved this task from Backlog to Done on the Beta-Cluster-Infrastructure board.

Thank you so much!!!

so this is resolved?

Yup!
I can see events here > https://grafana-labs-admin.wikimedia.org/dashboard/db/reading-web-beta-dashboard?orgId=1
Thanks all!

Krenair awarded a token.Oct 3 2018, 7:28 PM

Ottomata mentioned this in T211640: Grafana, icinga, prometheus in cloud-analytics project.Dec 11 2018, 7:18 PM

Prometheus resources in deployment-prep to create grafana graphs of EventLoggingClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Prometheus resources in deployment-prep to create grafana graphs of EventLogging
Closed, ResolvedPublic5 Estimated Story Points
Actions