Page MenuHomePhabricator

Port fundraising stats off Ganglia
Closed, ResolvedPublic

Description

In T145659: Port application-specific metrics from ganglia to prometheus I went over the recently-updated ganglia rrds to audit what's left. In P4571 there are some related to fundraising which will need porting either in graphite or prometheus. Beside the common basic machine metrics, most rrds seem to be activemq and various queues related cc @Jgreen

Event Timeline

Fundraising uses mostly stock ganglia stuff, but there are a couple of simple collectors I've written or imported from production. It shouldn't be very difficult to refactor those into promethius collectors, I just need some review/clarification on promethius's collection model.

@Jgreen I've outlined some of the deployment and architecture at https://wikitech.wikimedia.org/wiki/Prometheus plus docs at https://prometheus.io. By default metrics are polled/collected via http from each machine by the Prometheus server. It is also possible to push metrics as described in https://prometheus.io/docs/instrumenting/pushing/ but for limited use cases.

I think we have at least two options depending on where the Prometheus server is hosted:

  1. Production, re-using the existing servers and collecting the metrics via HTTP likely via a proxy to individual machines
  2. FR, one or two servers collecting from inside the FR network. Querying of data from the outside (e.g. grafana) happens via HTTP (or HTTPS, via a terminator) to the Prometheus server.

I'd recommend 2. because it is less reliant on the production/FR communication though there might be other requirements I'm not aware of.

By way of scoping:

My understanding is that we're using prometheus-node-exporter to on each host to collect local metrics and listen on a TCP port for the HTTP request from the relevant promethius server.

prometheus-node-exporter is available as a backport for jessie (version is 0.12.0+ds+really0.12.0-1~bpo8+1) but not available for trusty. To build a trusty package there's a formidable tree of dependencies, at least:

golang-github-julienschmidt-httprouter
golang-github-prometheus-log
golang-golang-x-crypto
golang-golang-x-net
golang-goprotobuf
golang-logrus
golang-procfs
golang-prometheus-client
golang-protobuf-extensions
golang-x-text
golang-github-bugsnag-bugsnag-go
golang-github-getsentry-raven-go
golang-github-stretchr-testify
golang-github-stvp-go-udp-testing
golang-github-tobi-airbrake-go

At least one of these is built against i.e. GNU C, so to do this sanely we have to be prepared to rebuild packages when there's another vulnerability.

So it may be preferable to wait until all fundraising hosts have been updated to jessie, so we get to take advantage of upstream jessie-backports security maintenance.

@Jgreen any idea when it will happen? (all FR to jessie, I mean).

We have six Precise boxes left to replace by the end of March, most are
waiting on procurement tasks. Once these are done we'll have all
metric-worthy services on Jessie.

There are also 6 Trusty boxes, including various log collectors,
puppetmasters, and bastions. We can live without metrics for these.

@Jgreen any news/updates on having FR fully on jessie?

@Jgreen any news/updates on having FR fully on jessie?

We're still waiting for the hardware install to replace the last Precise box (indium, a syslog collector), and we have roughly a half-dozen Trusty boxes to reimage.

We can live without the ganglia reporting for now if its important to shut that service down. We probably won't get to promethius collection until this summer given the amount of higher priority stuff in the pipeline.

@Jgreen any news/updates on having FR fully on jessie?

We're still waiting for the hardware install to replace the last Precise box (indium, a syslog collector), and we have roughly a half-dozen Trusty boxes to reimage.

We can live without the ganglia reporting for now if its important to shut that service down. We probably won't get to promethius collection until this summer given the amount of higher priority stuff in the pipeline.

Awesome, thanks for the update @Jgreen ! I think we can live with ganglia on life support until the summer, it is unmaintained and deprecated but otherwise working

cwdent subscribed.

Hi @fgiunchedi - prometheus is now running on pay-lvs*:9090

They are only watching themselves and one eqiad host, which also has the mysqld exporter.

Wondering what the next step is to get the prod install scraping them and making the data available to grafana, can you point me in the right direction?

Hi @fgiunchedi - prometheus is now running on pay-lvs*:9090

They are only watching themselves and one eqiad host, which also has the mysqld exporter.

Wondering what the next step is to get the prod install scraping them and making the data available to grafana, can you point me in the right direction?

That's awesome @cwdent !

To make the data available in grafana the easiest way is to allow access from the machines running grafana server (ATM krypton.eqiad.wmnet).

To collect aggregated data into the Prometheus global instance you should allow access from prometheus100[34].eqiad.wmnet. Configuration wise the global instance will need to know about the pay-lvs Prometheus and what metrics to pull from there, the config is in puppet at modules/role/manifests/prometheus/global.pp.

Thanks @fgiunchedi, I had assumed we'd be using the latter "federated"
approach so I did make those firewall holes already (but would still
need to tweak iptables). Puppet config looks pretty simple, I will take
a stab at a patch.

Do you think one way is better than another? Any reason we'd want to
feed the data directly to grafana instead?

I see as two different use cases mostly, accessing directly from grafana will let you access all metrics and datapoints that Prometheus server is collecting and retaining.
Whereas within the global instance we store aggregated data long-term (one year ATM) from Prometheus servers lower in the hierarchy.

The reason being efficiency in queries (e.g. give me the aggregate memory from all clusters in all sites) that would otherwise need to fetch metrics for all machines, the query only looks at aggregated data by cluster and by site. Another reason being scaling, as we wouldn't be able to store the exact same data from all sites in one big prometheus.

I'd suggest a combined approach to replicate what's already happening in production; namely both access from grafana for drilling down to the single machine and aggregated data from the fr cluster is federated up to the global instance (e.g. aggregated metrics from all machines). On the latter point, I think the easiest approach is to map what is happening in Ganglia now and consider fr as another production cluster (e.g. mysql, appserver, swift, etc).

Ejegg added subscribers: Ejegg, TerraCodes, Jay8g.

For CiviCRM code:

PHP client library: https://github.com/Jimdo/prometheus_client_php
Metric types: https://prometheus.io/docs/concepts/metric_types/

All we're using it for is to count donations per gateway and overall at the end of each donation queue consumer run. Sounds like the 'counter' metric is just fine.
The client library says it uses redis for client-side aggregation. Any reason we can't use the same instance as we do for queues?

For CiviCRM code:

PHP client library: https://github.com/Jimdo/prometheus_client_php
Metric types: https://prometheus.io/docs/concepts/metric_types/

All we're using it for is to count donations per gateway and overall at the end of each donation queue consumer run. Sounds like the 'counter' metric is just fine.
The client library says it uses redis for client-side aggregation. Any reason we can't use the same instance as we do for queues?

IMO if you can simply increment a counter in fr-queue redis, we can adjust the existing redis-prometheus collector to fetch this value too.

nginx and memcache metric collectors are done

@fgiunchedi as of now we're done with ganglia! I shut down the aggregators yesterday. I'm going to leave this task open while we finish some minor cleanup, but as far as production ganglia goes we are no longer using it.

Jgreen lowered the priority of this task from Medium to Low.Sep 20 2017, 1:40 AM

@Jgreen that's awesome news! I think we can finally shut down ganglia for good !

Closing this, there are a few subtasks left but we have parity with ganglia