Port fundraising stats off Ganglia
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Dec 6 2016, 10:52 PM

Description

In T145659: Port application-specific metrics from ganglia to prometheus I went over the recently-updated ganglia rrds to audit what's left. In P4571 there are some related to fundraising which will need porting either in graphite or prometheus. Beside the common basic machine metrics, most rrds seem to be activemq and various queues related cc @Jgreen

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		Jgreen	T91508 [Epic] overhaul fundraising cluster monitoring
Resolved		• cwdent	T152562 Port fundraising stats off Ganglia
Resolved		• cwdent	T175044 refactor collect_frqueue_redis_via_gmetric to prometheus
Declined		None	T175738 Long term storage for frack prometheus data
			Unknown Object (Task)
Resolved		• cwdent	T186073 Rack/setup frmon1001
Resolved		ayounsi	T198516 NAT and DNS for fundraising monitor host
Resolved		Jgreen	T198648 Authentication for grafana
Resolved	Spike	Jgreen	T175850 Spike: Enumerate remaining unported stats
Resolved		Jgreen	T176319 remove fundraising firewall rules related to ganglia
Declined		None	T176494 prometheus collector or exporter for banner log pipeline
Resolved		• cwdent	T176495 prometheus collector or exporter for postfix metrics

Event Timeline

fgiunchedi created this task.Dec 6 2016, 10:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 6 2016, 10:52 PM

fgiunchedi mentioned this in T145659: Port application-specific metrics from ganglia to prometheus.Dec 6 2016, 10:54 PM

Fundraising uses mostly stock ganglia stuff, but there are a couple of simple collectors I've written or imported from production. It shouldn't be very difficult to refactor those into promethius collectors, I just need some review/clarification on promethius's collection model.

@Jgreen I've outlined some of the deployment and architecture at https://wikitech.wikimedia.org/wiki/Prometheus plus docs at https://prometheus.io. By default metrics are polled/collected via http from each machine by the Prometheus server. It is also possible to push metrics as described in https://prometheus.io/docs/instrumenting/pushing/ but for limited use cases.

I think we have at least two options depending on where the Prometheus server is hosted:

Production, re-using the existing servers and collecting the metrics via HTTP likely via a proxy to individual machines
FR, one or two servers collecting from inside the FR network. Querying of data from the outside (e.g. grafana) happens via HTTP (or HTTPS, via a terminator) to the Prometheus server.

I'd recommend 2. because it is less reliant on the production/FR communication though there might be other requirements I'm not aware of.

By way of scoping:

My understanding is that we're using prometheus-node-exporter to on each host to collect local metrics and listen on a TCP port for the HTTP request from the relevant promethius server.

prometheus-node-exporter is available as a backport for jessie (version is 0.12.0+ds+really0.12.0-1~bpo8+1) but not available for trusty. To build a trusty package there's a formidable tree of dependencies, at least:

golang-github-julienschmidt-httprouter
golang-github-prometheus-log
golang-golang-x-crypto
golang-golang-x-net
golang-goprotobuf
golang-logrus
golang-procfs
golang-prometheus-client
golang-protobuf-extensions
golang-x-text
golang-github-bugsnag-bugsnag-go
golang-github-getsentry-raven-go
golang-github-stretchr-testify
golang-github-stvp-go-udp-testing
golang-github-tobi-airbrake-go

At least one of these is built against i.e. GNU C, so to do this sanely we have to be prepared to rebuild packages when there's another vulnerability.

So it may be preferable to wait until all fundraising hosts have been updated to jessie, so we get to take advantage of upstream jessie-backports security maintenance.

MoritzMuehlenhoff triaged this task as Medium priority.Jan 23 2017, 8:43 AM

@Jgreen any idea when it will happen? (all FR to jessie, I mean).

We have six Precise boxes left to replace by the end of March, most are
waiting on procurement tasks. Once these are done we'll have all
metric-worthy services on Jessie.

There are also 6 Trusty boxes, including various log collectors,
puppetmasters, and bastions. We can live without metrics for these.

@Jgreen any news/updates on having FR fully on jessie?

MoritzMuehlenhoff subscribed.May 2 2017, 2:36 PM

In T152562#3227173, @fgiunchedi wrote:

@Jgreen any news/updates on having FR fully on jessie?

We're still waiting for the hardware install to replace the last Precise box (indium, a syslog collector), and we have roughly a half-dozen Trusty boxes to reimage.

We can live without the ganglia reporting for now if its important to shut that service down. We probably won't get to promethius collection until this summer given the amount of higher priority stuff in the pipeline.

In T152562#3232264, @Jgreen wrote:

In T152562#3227173, @fgiunchedi wrote:

@Jgreen any news/updates on having FR fully on jessie?

We're still waiting for the hardware install to replace the last Precise box (indium, a syslog collector), and we have roughly a half-dozen Trusty boxes to reimage.

We can live without the ganglia reporting for now if its important to shut that service down. We probably won't get to promethius collection until this summer given the amount of higher priority stuff in the pipeline.

Awesome, thanks for the update @Jgreen ! I think we can live with ganglia on life support until the summer, it is unmaintained and deprecated but otherwise working

Jgreen added a parent task: T91508: [Epic] overhaul fundraising cluster monitoring.Jun 27 2017, 2:52 PM

Hi @fgiunchedi - prometheus is now running on pay-lvs*:9090

They are only watching themselves and one eqiad host, which also has the mysqld exporter.

Wondering what the next step is to get the prod install scraping them and making the data available to grafana, can you point me in the right direction?

In T152562#3551109, @cwdent wrote:

Hi @fgiunchedi - prometheus is now running on pay-lvs*:9090

They are only watching themselves and one eqiad host, which also has the mysqld exporter.

Wondering what the next step is to get the prod install scraping them and making the data available to grafana, can you point me in the right direction?

That's awesome @cwdent !

To make the data available in grafana the easiest way is to allow access from the machines running grafana server (ATM krypton.eqiad.wmnet).

To collect aggregated data into the Prometheus global instance you should allow access from prometheus100[34].eqiad.wmnet. Configuration wise the global instance will need to know about the pay-lvs Prometheus and what metrics to pull from there, the config is in puppet at modules/role/manifests/prometheus/global.pp.

Thanks @fgiunchedi, I had assumed we'd be using the latter "federated"
approach so I did make those firewall holes already (but would still
need to tweak iptables). Puppet config looks pretty simple, I will take
a stab at a patch.

Do you think one way is better than another? Any reason we'd want to
feed the data directly to grafana instead?

I see as two different use cases mostly, accessing directly from grafana will let you access all metrics and datapoints that Prometheus server is collecting and retaining.
Whereas within the global instance we store aggregated data long-term (one year ATM) from Prometheus servers lower in the hierarchy.

The reason being efficiency in queries (e.g. give me the aggregate memory from all clusters in all sites) that would otherwise need to fetch metrics for all machines, the query only looks at aggregated data by cluster and by site. Another reason being scaling, as we wouldn't be able to store the exact same data from all sites in one big prometheus.

I'd suggest a combined approach to replicate what's already happening in production; namely both access from grafana for drilling down to the single machine and aggregated data from the fr cluster is federated up to the global instance (e.g. aggregated metrics from all machines). On the latter point, I think the easiest approach is to map what is happening in Ganglia now and consider fr as another production cluster (e.g. mysql, appserver, swift, etc).

Jgreen created subtask T175044: refactor collect_frqueue_redis_via_gmetric to prometheus.Sep 5 2017, 5:12 PM

• cwdent created subtask T175738: Long term storage for frack prometheus data.Sep 12 2017, 7:54 PM

• cwdent created subtask T175850: Spike: Enumerate remaining unported stats.Sep 13 2017, 6:20 PM

For CiviCRM code:

PHP client library: https://github.com/Jimdo/prometheus_client_php
Metric types: https://prometheus.io/docs/concepts/metric_types/

All we're using it for is to count donations per gateway and overall at the end of each donation queue consumer run. Sounds like the 'counter' metric is just fine.
The client library says it uses redis for client-side aggregation. Any reason we can't use the same instance as we do for queues?

Ejegg added a project: Fundraising-Backlog.Sep 18 2017, 10:14 PM

In T152562#3616435, @Ejegg wrote:

For CiviCRM code:

PHP client library: https://github.com/Jimdo/prometheus_client_php
Metric types: https://prometheus.io/docs/concepts/metric_types/

All we're using it for is to count donations per gateway and overall at the end of each donation queue consumer run. Sounds like the 'counter' metric is just fine.
The client library says it uses redis for client-side aggregation. Any reason we can't use the same instance as we do for queues?

IMO if you can simply increment a counter in fr-queue redis, we can adjust the existing redis-prometheus collector to fetch this value too.

• DStrine moved this task from Triage to FR-Ops on the Fundraising-Backlog board.Sep 19 2017, 6:22 PM

• cwdent closed subtask T175044: refactor collect_frqueue_redis_via_gmetric to prometheus as Resolved.Sep 19 2017, 8:22 PM

nginx and memcache metric collectors are done

@fgiunchedi as of now we're done with ganglia! I shut down the aggregators yesterday. I'm going to leave this task open while we finish some minor cleanup, but as far as production ganglia goes we are no longer using it.

Jgreen lowered the priority of this task from Medium to Low.Sep 20 2017, 1:40 AM

Dzahn awarded a token.Sep 20 2017, 1:45 AM

@Jgreen that's awesome news! I think we can finally shut down ganglia for good !

Jgreen added a subtask: T176319: remove fundraising firewall rules related to ganglia.Sep 20 2017, 2:14 PM

Jgreen created subtask T176494: prometheus collector or exporter for banner log pipeline.Sep 22 2017, 3:30 PM

Jgreen created subtask T176495: prometheus collector or exporter for postfix metrics.Sep 22 2017, 3:32 PM

• cwdent closed subtask T175738: Long term storage for frack prometheus data as Resolved.Sep 22 2017, 5:54 PM

Joe unsubscribed.Sep 25 2017, 6:19 AM

• cwdent closed subtask T176495: prometheus collector or exporter for postfix metrics as Resolved.Sep 29 2017, 7:55 PM

Closing this, there are a few subtasks left but we have parity with ganglia

Jgreen closed subtask T176319: remove fundraising firewall rules related to ganglia as Resolved.Oct 2 2017, 2:49 PM

fgiunchedi reopened subtask T175738: Long term storage for frack prometheus data as Open.Nov 14 2017, 2:39 PM