Page MenuHomePhabricator

graphite clustering plan
Closed, DeclinedPublic

Description

graphite natively supports "simple" clustering:

  • write path: consistent-hashing in carbon-relay allows writing to multiple destinations
  • read path: the web interface has a list of peers it'll ask metrics from, it will use the first peer that returns a result

so the idea is to put hosts into a consistent hash ring and route metrics to that ring, effectively creating a cluster. To aid managing the clusters we are going to use carbonate which provides some plumbing tools.

Assuming we have two clusters a and s each containing a number of machines (initially one for example) and both receiving the same write traffic, reads happen only from one at the time via e.g. a DNS record.

To expand a given cluster e.g. a

  1. failover read traffic to the other cluster s
  2. configure a second cluster containing old machines plus machines to be added (a "stage cluster", e.g. a:stage)
  3. rebalance the cluster by transferring metrics from where they belong in cluster a to where they belong in a:stage
  4. promote new machines from a:stage into a
    • this means for example that writes are now going to the new machines too since they are considered effectively part of cluster a
  5. backfill missing datapoints from cluster s
    • the time span to cover is from step 3 to step 4, namely from the initial cluster rebalance of a and a:stage to the time new machines are part of a
  6. remove metrics that don't belong to the a cluster anymore.
  7. fail back from cluster s to cluster a

So with carbonate the rebalance step looks like this:

for host in $(carbon-hosts --cluster a); do
  ssh $host -- carbon-list |
    carbon-sieve --cluster a:stage --node $HOSTNAME |
    carbon-sync --cluster a --source-node $host
done

Whereas promoting a new machine from a:stage to a amounts to:

  1. change carbon-c-relay configuration to include new machines
  2. change graphite-web settings to include new machines
  3. change carbonate configuration to include new machines in cluster a
  4. reload all of the above

Backfilling missing datapoints from s to a is very similar to rebalancing:

for host in $(carbon-hosts --cluster s); do
  ssh $host -- carbon-list |
    # XXX pick only metrics modified in the last n hours
    carbon-sieve --cluster a --node $HOSTNAME |
    carbon-sync --cluster s --source-node $host
done

Event Timeline

fgiunchedi claimed this task.
fgiunchedi raised the priority of this task from to Needs Triage.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added projects: Grafana, acl*sre-team.
fgiunchedi added subscribers: Aklapper, fgiunchedi, mark.
chasemp triaged this task as Medium priority.Jan 9 2015, 5:01 PM
chasemp set Security to None.

Change 199636 had a related patch set uploaded (by Filippo Giunchedi):
graphite: enable locking writes

https://gerrit.wikimedia.org/r/199636

Change 199636 merged by Filippo Giunchedi:
graphite: enable locking writes

https://gerrit.wikimedia.org/r/199636

@fgiunchedi: Hi, the patch in Gerrit has been merged. Can this task be resolved (via Add Action...Change Status in the dropdown menu), or is there more to do in this task? If there is more to do, do you still plan to work on this? Asking as you are set as task assignee. Thanks in advance!

Graphite is on its way out, declining