Page MenuHomePhabricator

Upgrade graphite from 0.9.x to 1.x
Closed, ResolvedPublic

Description

I believe we currently run 0.9.13, but not exactly sure. We seem to have some of the 0.9.14 features in prod.

A few of the useful features that have been added since:

  • (0.9.15) Make removeAbovePercentile() work again. (Fixed index exception)
  • (1.0.0) Support for time units (sec, msec) in yUnitSystem. – https://github.com/graphite-project/graphite-web/pull/1220
  • (1.0.0) Support for globstar matching in target paths.
  • (1.0.0) Faster calculation algorithm for movingAverage().
  • (1.0.0) Improve json rendering performance. – Helps Grafana

Added functions between 0.9.13 and 1.0.2:

+aggregateLine
+applyByNode
 ..
 averageAbove
 averageBelow
+averageOutsidePercentile
 averageSeries
 ..
+delay
 ..
 divideSeries
+divideSeriesLists
 ..
+exponentialMovingAverage
+fallbackSeries
 ..
 group
 groupByNode
+groupByNodes
 ..
 integral
+integralByInterval
+interpolate
+invert
+linearRegression
+linearRegressionAnalysis
 ..
 movingAverage
+movingMax
 movingMedian
+movingMin
+movingSum
 multiplySeries
+multiplySeriesWithWildcards
 ..
 offset
+offsetToZero
+pow
+powSeries
+reduceSeries
 ..
 removeAbovePercentile
 removeBelowPercentile
+removeBetweenPercentile
+removeEmptySeries
 ..
 sortByMaxima
 sortByMinima
+sortByName
+sortByTotal
+squareRoot
 ..
 timeShift
+timeSlice
+verticalLine
+weightedAverage

Other notable changes that may affect our upgrading:

  • (1.0.0) [Graphite-Web]
    • Brand new clustering implementation using a pool of worker threads and persistent connections to backends
    • Python’s own log rotation can be disabled using the LOG_ROTATION setting. This is useful when running multiple WSGI workers.
    • Cluster servers can now communicate over HTTPS when INTRACLUSTER_HTTPS is enabled.
    • Readers are more resilient to the loss of a single backend.
    • Support 0.9.x backends in 1.0.0 cluster.
  • (1.0.0) [Carbon]
    • Support logging to syslog with the --syslog runtime option.

Event Timeline

Krinkle created this task.May 23 2017, 8:44 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 23 2017, 8:44 PM
Krinkle updated the task description. (Show Details)May 23 2017, 8:54 PM
Krinkle moved this task from Inbox to graphite-web on the Graphite board.
Krinkle moved this task from Inbox to Radar on the Performance-Team board.
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)Sep 26 2017, 8:09 PM

1.1.0 has already been pulled, it looks like. 1.1.1 release notes are posted with a release date that's 2 days in the future:

@fgiunchedi Is there any sense of if/when this might happen? We have a couple of things that we're considering that would be easier if Graphite 1.0+ were available, but don't want to schedule them without a sense of timing.

On a (potentially related?) note, should we instead be looking to just move things over to Prometheus instead?

@fgiunchedi Is there any sense of if/when this might happen? We have a couple of things that we're considering that would be easier if Graphite 1.0+ were available, but don't want to schedule them without a sense of timing.

Not a strict timeline no, afaik what's needed is building Debian packages for latest graphite and test them on jessie e.g. in labs then upgrade in production.

On a (potentially related?) note, should we instead be looking to just move things over to Prometheus instead?

Personally I'm very invested in Prometheus so I'd definitely welcome moving to Prometheus, what are the things you'd be interested in doing with Graphite ? If those are easier with Prometheus then definitely worth a shot.

There were two specific use cases that I/we had in mind:

  • Tagging metrics to allow for certain types of filtering. For example, we collect a bunch of performance timing metrics from users. In some cases, we may want to tag with certain info -- for example, country code.
  • For some types of information, the way that graphite does aggregation actually breaks the metrics. For example, if we're recording the 75th percentile of a given metric, and then ,we try to aggregate it, we actually end up with a value that's not actually representative.

Prometheus solves both of these, as far as I can tell, which is great! Two questions:

  • My understanding is that the retention in Prometheus is currently limited to 1 year. Is that right? (It's not necessarily a deal killer in general, but it might limit us in certain use cases.)
  • Is there documentation on how we should be getting data in to Prometheus? I found the general writeup at https://wikitech.wikimedia.org/wiki/Prometheus, but it doesn't actually specify what the right way to go about adding metrics is. I'm very likely just looking in the wrong place, though.

Indeed Prometheus would solve both problems, with a slightly different approach to histograms than statsd to make it possible to aggregate data (essentially what's described in T175087).

re: retention the plan is to extend 1yr retention of the global instance further since disk space usage has proven to be ok so far.

re: adding metrics, I've expanded a little the introduction at https://wikitech.wikimedia.org/wiki/Prometheus#Adding_new_metrics but I'm sure there's more to add, let me know what parts are still obscure!

Peter added a subscriber: Peter.Aug 30 2018, 2:11 PM

I didn't realize the upgrade to graphite 1.0 would happen as a side effect of T196484: rack/setup/install graphite1004. We've put graphite1004 (and thus graphite 1.0.2) in production just today, please test it! I've migrated the sqlite3 database too so user preferences should be kept as is.

colewhite moved this task from Inbox to Up next on the observability board.Nov 26 2018, 4:12 PM
Krinkle closed this task as Resolved.Dec 3 2018, 3:26 AM
Krinkle claimed this task.
Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).

Reviewed. LGTM.

  • Things still works fine (checked a few Graphite render API and PNG calls, and Grafana dashboards).
  • Old data was imported without issue.
  • A couple of bugs were fixed, although nothing we were currently making use of in our dashboards (possibly due to us having worked around it since; we tend to only document issues we can't work around).
  • We've got a few new metric functions. The names were already being auto-completed in Grafana, but didn't work until now.