Page MenuHomePhabricator

Deprecate cassandra-metrics-collector?
Closed, ResolvedPublic

Description

The RESTBase cluster has been converted to Prometheus metrics, and there seems to be some interest in doing the same with AQS. The collector is currently setup as part of the profile since it has been the assumption that every Cassandra node should have it configured. We need to either a) convert every cluster to Prometheus and then remove cassandra-metrics-collector entirely, or b) refactor Puppet to make the choice of one or the other configurable.

In addition to AQS, there is a maps cluster that would need to be taken into account as well.

Event Timeline

I'm favor of deprecating cmc in favor of Prometheus, and going with a) as that seems simpler and the end state we want to be in anyway.

Eevans triaged this task as Medium priority.Feb 22 2018, 4:44 PM

To summarize a meeting w/ @Gehel :

  • Maps cluster is running Cassandra v2.1 (AQS runs v2.2 and RESTBase 3.11)
  • Maps cluster hasn't received much attention, but hasn't needed much either; Has been trouble-free
  • Currently no development resources allocated to making changes
  • Some metric names changed between v2.1 and v2.2, and again between v2.2 and v3.11
    • Making dashboards work interchangeably for v2.2 and v3.11 clusters will require some work (may not be entirely practical)
    • Making dashboards work interchangeably for v2.1, 2.2, and 3.11 clusters will require even more work (could be even less practical)

I propose that we first start with the AQS cluster by installing the Prometheus exporter and cassandra-metrics-collector side-by-side, and see where we stand with the dashboards. If it turns out to be straightforward to make the dashboards work interchangeably, then we can consider either a) doing the same with Maps, or b) upgrading Maps to Cassandra 2.2.

We should probably also reconsider a Puppet refactoring to make the choice of metrics collection configurable (since a quick transition is seeming less and less tractable).

I propose that we first start with the AQS cluster by installing the Prometheus exporter and cassandra-metrics-collector side-by-side, and see where we stand with the dashboards. If it turns out to be straightforward to make the dashboards work interchangeably, then we can consider either a) doing the same with Maps, or b) upgrading Maps to Cassandra 2.2.

AQS is now running the jmx agent and its metrics are available in the prometheus analytics instance (not the services one). @Eevans I'd love to keep one set of dashboard shared between restbase and aqs/maps to avoid duplication of efforts, whenever you have time let's chat about the differences between 2.2 and 3.x.

I propose that we first start with the AQS cluster by installing the Prometheus exporter and cassandra-metrics-collector side-by-side, and see where we stand with the dashboards. If it turns out to be straightforward to make the dashboards work interchangeably, then we can consider either a) doing the same with Maps, or b) upgrading Maps to Cassandra 2.2.

AQS is now running the jmx agent and its metrics are available in the prometheus analytics instance (not the services one). @Eevans I'd love to keep one set of dashboard shared between restbase and aqs/maps to avoid duplication of efforts, whenever you have time let's chat about the differences between 2.2 and 3.x.

I haven't looked too deeply, but if we could s/columnfamily/table/ the metric names, I think we'd be 99% of the way there, and without any calisthenics on the Grafana-side, (either via prometheus_jmx_exporter.yaml, or on the Prometheus server side of things).

After T193017, the AQS cluster shares the same cassandra dashboards with Restbase. I tried this morning to disable the cassandra-metrics-collector on AQS (touching /etc/cassandra-metrics-collector/disable) but it doesn't seem to work due to:

  1. the puppet class cassandra::metrics that hardcodes versions of the collectors to use
if $target_cassandra_version == '2.1' {
    $collector_version = '2.1.1-20160520.211019-1'
} elsif $target_cassandra_version == '2.2' {
    $collector_version = '3.1.4-20170427.001104-1'
} else {
    $collector_version = '4.1.0'
}
  1. the jar related to cassandra-metrics-collector version 4.10 not present on AQS nodes (they are not in the scap deploy list of cassandra-metrics-collector).

While we wait for maps I think that a profile option to disable the metrics collector would be good for Restbase/AQS.

Change 431546 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::aqs: deprecate cassandra-metrics-collector

https://gerrit.wikimedia.org/r/431546

Change 431546 merged by Elukey:
[operations/puppet@production] role::aqs: deprecate cassandra-metrics-collector

https://gerrit.wikimedia.org/r/431546

Change 431710 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] cassandra::metrics: propagate ensure parameter to wmf-auto-restart

https://gerrit.wikimedia.org/r/431710

Change 431710 merged by Elukey:
[operations/puppet@production] cassandra::metrics: propagate ensure parameter to wmf-auto-restart

https://gerrit.wikimedia.org/r/431710

Cassandra metrics collector removed from AQS!

Change 444247 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/puppet@production] RESTBase: Disable cassandra-metrics-collector

https://gerrit.wikimedia.org/r/444247

Change 444247 merged by Filippo Giunchedi:
[operations/puppet@production] RESTBase: Disable cassandra-metrics-collector

https://gerrit.wikimedia.org/r/444247

Eevans changed the task status from Open to Stalled.Jul 10 2018, 7:28 PM

This service is now disabledd/removed on the RESTBase cluster; Thanks a bunch for the assist!

However, the original objective (the deprecation of the collector) remains unresolved until (if?) the maps cluster is upgraded to Cassandra 2.2.x. :(

elukey@maps1001:~$ dpkg -l | grep cassandra
ii  cassandra                            2.2.6-wmf5                        all          distributed storage system for structured data
ii  cassandra-tools                      2.2.6-wmf5                        all          distributed storage system for structured data
ii  cassandra-tools-wmf                  1.0.2-1                           all          add-ons to make Wikimedia Cassandra operations easier
ii  python-cassandra                     3.7.1-2.1                         amd64        Python driver for Apache Cassandra

@Eevans I re-discovered this task while trying to deploy AQS in Cloud without a deployment server, since it failed due cassandra::metrics (even if absented it tries to ensure some scap config). Maps seems to be on 2.2 now, maybe we could think about moving the cluster as we did for AQS?

Change 662634 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] cassandra::metrics: extend ensure to scap target

https://gerrit.wikimedia.org/r/662634

Change 662634 abandoned by Elukey:
[operations/puppet@production] cassandra::metrics: extend ensure to scap target

Reason:

https://gerrit.wikimedia.org/r/662634

Change 710984 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] maps: disable cassandra metrics collector

https://gerrit.wikimedia.org/r/710984

Change 710985 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] cassandra: remove cassandra-metrics-collector

https://gerrit.wikimedia.org/r/710985

Change 710984 merged by Hnowlan:

[operations/puppet@production] maps: disable cassandra metrics collector

https://gerrit.wikimedia.org/r/710984

Change 710985 merged by Hnowlan:

[operations/puppet@production] cassandra: remove cassandra-metrics-collector

https://gerrit.wikimedia.org/r/710985