graphite clustering plan
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Jan 9 2015, 4:55 PM

Description

graphite natively supports "simple" clustering:

write path: consistent-hashing in carbon-relay allows writing to multiple destinations
read path: the web interface has a list of peers it'll ask metrics from, it will use the first peer that returns a result

so the idea is to put hosts into a consistent hash ring and route metrics to that ring, effectively creating a cluster. To aid managing the clusters we are going to use carbonate which provides some plumbing tools.

Assuming we have two clusters a and s each containing a number of machines (initially one for example) and both receiving the same write traffic, reads happen only from one at the time via e.g. a DNS record.

To expand a given cluster e.g. a

failover read traffic to the other cluster s
configure a second cluster containing old machines plus machines to be added (a "stage cluster", e.g. a:stage)
rebalance the cluster by transferring metrics from where they belong in cluster a to where they belong in a:stage
promote new machines from a:stage into a
- this means for example that writes are now going to the new machines too since they are considered effectively part of cluster a
backfill missing datapoints from cluster s
- the time span to cover is from step 3 to step 4, namely from the initial cluster rebalance of a and a:stage to the time new machines are part of a
remove metrics that don't belong to the a cluster anymore.
fail back from cluster s to cluster a

So with carbonate the rebalance step looks like this:

for host in $(carbon-hosts --cluster a); do
  ssh $host -- carbon-list |
    carbon-sieve --cluster a:stage --node $HOSTNAME |
    carbon-sync --cluster a --source-node $host
done

Whereas promoting a new machine from a:stage to a amounts to:

change carbon-c-relay configuration to include new machines
change graphite-web settings to include new machines
change carbonate configuration to include new machines in cluster a
reload all of the above

Backfilling missing datapoints from s to a is very similar to rebalancing:

for host in $(carbon-hosts --cluster s); do
  ssh $host -- carbon-list |
    # XXX pick only metrics modified in the last n hours
    carbon-sieve --cluster a --node $HOSTNAME |
    carbon-sync --cluster s --source-node $host
done

Details

	Subject	Repo	Branch	Lines +/-
	graphite: enable locking writes	operations/puppet	production	+3 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Eevans	T134016 RESTBase Cassandra cluster: Increase instance count to 3
Invalid	fgiunchedi	T85451 scale graphite deployment (tracking)
Declined	fgiunchedi	T86316 graphite clustering plan

Event Timeline

fgiunchedi created this task.Jan 9 2015, 4:55 PM

fgiunchedi claimed this task.

fgiunchedi raised the priority of this task from to Needs Triage.

fgiunchedi updated the task description. (Show Details)

fgiunchedi added projects: Grafana, acl*sre-team.

fgiunchedi added subscribers: Aklapper, fgiunchedi, mark.

• chasemp triaged this task as Medium priority.Jan 9 2015, 5:01 PM

• chasemp set Security to None.

Change 199636 had a related patch set uploaded (by Filippo Giunchedi):
graphite: enable locking writes

https://gerrit.wikimedia.org/r/199636

gerritbot added a project: Patch-For-Review.Mar 25 2015, 4:58 PM

Change 199636 merged by Filippo Giunchedi:
graphite: enable locking writes

https://gerrit.wikimedia.org/r/199636

fgiunchedi mentioned this in rOPUP1e113369ea54: graphite: enable locking writes.Apr 20 2015, 1:25 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 7:57 PM

@fgiunchedi: Hi, the patch in Gerrit has been merged. Can this task be resolved (via Add Action... → Change Status in the dropdown menu), or is there more to do in this task? If there is more to do, do you still plan to work on this? Asking as you are set as task assignee. Thanks in advance!

Graphite is on its way out, declining

graphite clustering planClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

graphite clustering plan
Closed, DeclinedPublic
Actions

Related Objects
Search...