graphite natively supports "simple" clustering:
- write path: consistent-hashing in carbon-relay allows writing to multiple destinations
- read path: the web interface has a list of peers it'll ask metrics from, it will use the first peer that returns a result
so the idea is to put hosts into a consistent hash ring and route metrics to that ring, effectively creating a cluster. To aid managing the clusters we are going to use carbonate which provides some plumbing tools.
Assuming we have two clusters a and s each containing a number of machines (initially one for example) and both receiving the same write traffic, reads happen only from one at the time via e.g. a DNS record.
To expand a given cluster e.g. a
- failover read traffic to the other cluster s
- configure a second cluster containing old machines plus machines to be added (a "stage cluster", e.g. a:stage)
- rebalance the cluster by transferring metrics from where they belong in cluster a to where they belong in a:stage
- promote new machines from a:stage into a
- this means for example that writes are now going to the new machines too since they are considered effectively part of cluster a
- backfill missing datapoints from cluster s
- the time span to cover is from step 3 to step 4, namely from the initial cluster rebalance of a and a:stage to the time new machines are part of a
- remove metrics that don't belong to the a cluster anymore.
- fail back from cluster s to cluster a
So with carbonate the rebalance step looks like this:
for host in $(carbon-hosts --cluster a); do ssh $host -- carbon-list | carbon-sieve --cluster a:stage --node $HOSTNAME | carbon-sync --cluster a --source-node $host done
Whereas promoting a new machine from a:stage to a amounts to:
- change carbon-c-relay configuration to include new machines
- change graphite-web settings to include new machines
- change carbonate configuration to include new machines in cluster a
- reload all of the above
Backfilling missing datapoints from s to a is very similar to rebalancing:
for host in $(carbon-hosts --cluster s); do ssh $host -- carbon-list | # XXX pick only metrics modified in the last n hours carbon-sieve --cluster a --node $HOSTNAME | carbon-sync --cluster s --source-node $host done