In T145659: Port application-specific metrics from ganglia to prometheus I went over the recently-updated ganglia rrds to audit what's left. In P4571 there are some related to fundraising which will need porting either in graphite or prometheus. Beside the common basic machine metrics, most rrds seem to be activemq and various queues related cc @Jgreen
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Jgreen | T91508 [Epic] overhaul fundraising cluster monitoring | |||
Resolved | • cwdent | T152562 Port fundraising stats off Ganglia | |||
Resolved | • cwdent | T175044 refactor collect_frqueue_redis_via_gmetric to prometheus | |||
Declined | None | T175738 Long term storage for frack prometheus data | |||
Unknown Object (Task) | |||||
Resolved | • cwdent | T186073 Rack/setup frmon1001 | |||
Resolved | ayounsi | T198516 NAT and DNS for fundraising monitor host | |||
Resolved | Jgreen | T198648 Authentication for grafana | |||
Resolved | Spike | Jgreen | T175850 Spike: Enumerate remaining unported stats | ||
Resolved | Jgreen | T176319 remove fundraising firewall rules related to ganglia | |||
Declined | None | T176494 prometheus collector or exporter for banner log pipeline | |||
Resolved | • cwdent | T176495 prometheus collector or exporter for postfix metrics |
Event Timeline
Fundraising uses mostly stock ganglia stuff, but there are a couple of simple collectors I've written or imported from production. It shouldn't be very difficult to refactor those into promethius collectors, I just need some review/clarification on promethius's collection model.
@Jgreen I've outlined some of the deployment and architecture at https://wikitech.wikimedia.org/wiki/Prometheus plus docs at https://prometheus.io. By default metrics are polled/collected via http from each machine by the Prometheus server. It is also possible to push metrics as described in https://prometheus.io/docs/instrumenting/pushing/ but for limited use cases.
I think we have at least two options depending on where the Prometheus server is hosted:
- Production, re-using the existing servers and collecting the metrics via HTTP likely via a proxy to individual machines
- FR, one or two servers collecting from inside the FR network. Querying of data from the outside (e.g. grafana) happens via HTTP (or HTTPS, via a terminator) to the Prometheus server.
I'd recommend 2. because it is less reliant on the production/FR communication though there might be other requirements I'm not aware of.
By way of scoping:
My understanding is that we're using prometheus-node-exporter to on each host to collect local metrics and listen on a TCP port for the HTTP request from the relevant promethius server.
prometheus-node-exporter is available as a backport for jessie (version is 0.12.0+ds+really0.12.0-1~bpo8+1) but not available for trusty. To build a trusty package there's a formidable tree of dependencies, at least:
golang-github-julienschmidt-httprouter
golang-github-prometheus-log
golang-golang-x-crypto
golang-golang-x-net
golang-goprotobuf
golang-logrus
golang-procfs
golang-prometheus-client
golang-protobuf-extensions
golang-x-text
golang-github-bugsnag-bugsnag-go
golang-github-getsentry-raven-go
golang-github-stretchr-testify
golang-github-stvp-go-udp-testing
golang-github-tobi-airbrake-go
At least one of these is built against i.e. GNU C, so to do this sanely we have to be prepared to rebuild packages when there's another vulnerability.
So it may be preferable to wait until all fundraising hosts have been updated to jessie, so we get to take advantage of upstream jessie-backports security maintenance.
We have six Precise boxes left to replace by the end of March, most are
waiting on procurement tasks. Once these are done we'll have all
metric-worthy services on Jessie.
There are also 6 Trusty boxes, including various log collectors,
puppetmasters, and bastions. We can live without metrics for these.
We're still waiting for the hardware install to replace the last Precise box (indium, a syslog collector), and we have roughly a half-dozen Trusty boxes to reimage.
We can live without the ganglia reporting for now if its important to shut that service down. We probably won't get to promethius collection until this summer given the amount of higher priority stuff in the pipeline.
Awesome, thanks for the update @Jgreen ! I think we can live with ganglia on life support until the summer, it is unmaintained and deprecated but otherwise working
Hi @fgiunchedi - prometheus is now running on pay-lvs*:9090
They are only watching themselves and one eqiad host, which also has the mysqld exporter.
Wondering what the next step is to get the prod install scraping them and making the data available to grafana, can you point me in the right direction?
That's awesome @cwdent !
To make the data available in grafana the easiest way is to allow access from the machines running grafana server (ATM krypton.eqiad.wmnet).
To collect aggregated data into the Prometheus global instance you should allow access from prometheus100[34].eqiad.wmnet. Configuration wise the global instance will need to know about the pay-lvs Prometheus and what metrics to pull from there, the config is in puppet at modules/role/manifests/prometheus/global.pp.
Thanks @fgiunchedi, I had assumed we'd be using the latter "federated"
approach so I did make those firewall holes already (but would still
need to tweak iptables). Puppet config looks pretty simple, I will take
a stab at a patch.
Do you think one way is better than another? Any reason we'd want to
feed the data directly to grafana instead?
I see as two different use cases mostly, accessing directly from grafana will let you access all metrics and datapoints that Prometheus server is collecting and retaining.
Whereas within the global instance we store aggregated data long-term (one year ATM) from Prometheus servers lower in the hierarchy.
The reason being efficiency in queries (e.g. give me the aggregate memory from all clusters in all sites) that would otherwise need to fetch metrics for all machines, the query only looks at aggregated data by cluster and by site. Another reason being scaling, as we wouldn't be able to store the exact same data from all sites in one big prometheus.
I'd suggest a combined approach to replicate what's already happening in production; namely both access from grafana for drilling down to the single machine and aggregated data from the fr cluster is federated up to the global instance (e.g. aggregated metrics from all machines). On the latter point, I think the easiest approach is to map what is happening in Ganglia now and consider fr as another production cluster (e.g. mysql, appserver, swift, etc).
For CiviCRM code:
PHP client library: https://github.com/Jimdo/prometheus_client_php
Metric types: https://prometheus.io/docs/concepts/metric_types/
All we're using it for is to count donations per gateway and overall at the end of each donation queue consumer run. Sounds like the 'counter' metric is just fine.
The client library says it uses redis for client-side aggregation. Any reason we can't use the same instance as we do for queues?
IMO if you can simply increment a counter in fr-queue redis, we can adjust the existing redis-prometheus collector to fetch this value too.
@fgiunchedi as of now we're done with ganglia! I shut down the aggregators yesterday. I'm going to leave this task open while we finish some minor cleanup, but as far as production ganglia goes we are no longer using it.