scale graphite deployment (tracking)
Closed, InvalidPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Dec 29 2014, 2:23 PM

Description

tracking graphite scaling work using this ticket, see also https://wikitech.wikimedia.org/wiki/Graphite/Scaling for a general overview of the options and http://etherpad.wikimedia.org/p/graphitetodo for a scratchpad/ramblings.

currently on the plate:

get two SSD machines (1x eqiad 1x codfw) https://rt.wikimedia.org/Ticket/Display.html?id=9105
use carbon-c-relay to mirror metrics to both sites https://gerrit.wikimedia.org/r/181080
- adapt txstatsd to use plaintext/line metric protocol https://gerrit.wikimedia.org/r/180786
do an initial import of metrics from carbon to new SSD machines
flip traffic to new machines
backfill remaining metrics with carbonate

future plans:

expand the setup beyond 1+1 machine using graphite clustering
- route/cluster metrics with carbon-c-relay and carbon consistent hashing
- use carbonate to rebalance metric data around also using the same consistent hashing
consider hybrid caching with bcache (i.e. spinning disks + block caching on ssd)

Details

Subject	Repo	Branch	Lines +/-
graphite: add realserver class	operations/puppet	production	+8 -0
svc: add graphite LVS addresses	operations/dns	master	+4 -0
lvs: add graphite service	operations/puppet	production	+63 -0
graphite: introduce local carbon-c-relay daemon	operations/puppet	production	+64 -0
graphite: add 'big_users' route and cluster	operations/puppet	production	+26 -14
graphite: add cluster_servers graphite-web setting	operations/puppet	production	+11 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Eevans	T134016 RESTBase Cassandra cluster: Increase instance count to 3
Invalid	fgiunchedi	T85451 scale graphite deployment (tracking)
Resolved	fgiunchedi	T85907 acquire graphite hardware in codfw and eqiad
Resolved	fgiunchedi	T85908 replicate metric traffic in eqiad and codfw
Resolved	fgiunchedi	T85909 migrate graphite to new hardware
		Restricted Task
		Restricted Task
		Restricted Task
Declined	fgiunchedi	T86316 graphite clustering plan
Invalid	fgiunchedi	T89857 scale statsd reporting/aggregation (plan)
Resolved	fgiunchedi	T90111 replace txstatsd
Resolved	fgiunchedi	T95627 Counts with underscore in name no longer updated since move to statsite (cassandra metrics)
Resolved	fgiunchedi	T95703 enable statsd extended counters
Resolved	fgiunchedi	T90591 backfill metrics from tungsten to graphite1001
Resolved	fgiunchedi	T102575 document graphite failover/backfill procedures
Resolved	RobH	T126253 additional graphite machines request, 1x per DC
		Unknown Object (Task)
Resolved	• Cmjohnson	T130752 reclaim restbase1001-1006 to spares
Resolved	• Cmjohnson	T128107 install restbase1010-restbase1015
		Unknown Object (Task)
Resolved	fgiunchedi	T130938 rack/setup new host graphite2002
		Unknown Object (Task)
Resolved	fgiunchedi	T132717 rack and set up graphite1003
Resolved	fgiunchedi	T134889 put additional graphite machines in service
Declined	fgiunchedi	T135385 investigate carbon-c-relay stalls/drops towards graphite2002

Event Timeline

fgiunchedi created this task.Dec 29 2014, 2:23 PM

fgiunchedi raised the priority of this task from to Needs Triage.

fgiunchedi updated the task description. (Show Details)

fgiunchedi added projects: acl*sre-team, Grafana.

fgiunchedi subscribed.

fgiunchedi updated the task description. (Show Details)Dec 29 2014, 2:27 PM

fgiunchedi set Security to None.

fgiunchedi claimed this task.Dec 29 2014, 2:38 PM

Let's update this ticket with sub tasks / blockers for the tasks mentioned below. Some of them are in old RT links of course, but let's change that into the newer Phab tickets...

yep, created the subtasks now to migrate to new hardware

fgiunchedi triaged this task as High priority.Jan 12 2015, 4:46 PM

fgiunchedi closed subtask T85907: acquire graphite hardware in codfw and eqiad as Resolved.Feb 24 2015, 3:34 PM

fgiunchedi closed subtask T85909: migrate graphite to new hardware as Resolved.Feb 24 2015, 3:39 PM

fgiunchedi reopened subtask T90111: replace txstatsd as Open.Mar 25 2015, 11:10 AM

• GWicke mentioned this in T78514: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts.Mar 25 2015, 8:24 PM

• GWicke subscribed.

fgiunchedi closed subtask T90111: replace txstatsd as Resolved.Apr 16 2015, 8:49 AM

fgiunchedi mentioned this in T89857: scale statsd reporting/aggregation (plan).Apr 16 2015, 8:54 AM

fgiunchedi changed the status of subtask T89857: scale statsd reporting/aggregation (plan) from Open to Stalled.

@fgiunchedi, what is the plan for increasing SSD space? If we plan to continue using a single graphite node per DC, can we order larger SSDs and replace the current ones to have some headroom for more metrics?

• GWicke mentioned this in T95703: enable statsd extended counters.Apr 17 2015, 11:18 PM

Host uranlum has two SSDs installed, and the R610 planform can usually support 4 2.5" disks installed.

If you guys wanted to upgrade/add/replace the disks, please create a hardware-request task linked to this, requesting such.

Thanks.

So, I looked at uranium, not graphite1001. graphite1001 is an R420 with 4 disks already installed.

So this host actually CANNOT add additional SSDs without removing existing disks.

robh@graphite1001:~$ sudo lshw -class disk

*-disk                  
     description: ATA Disk
     product: INTEL SSDSC2BB60
     physical id: 0.0.0
     bus info: scsi@0:0.0.0
     logical name: /dev/sda
     version: D201
     serial: PHWL4316007M600TGN
     size: 558GiB (600GB)
     capabilities: partitioned partitioned:dos
     configuration: ansiversion=5 sectorsize=4096 signature=0004e927
*-disk
     description: ATA Disk
     product: INTEL SSDSC2BB60
     physical id: 0.0.0
     bus info: scsi@1:0.0.0
     logical name: /dev/sdb
     version: D201
     serial: PHWL43160035600TGN
     size: 558GiB (600GB)
     capabilities: partitioned partitioned:dos
     configuration: ansiversion=5 sectorsize=4096 signature=000559e0
*-disk
     description: ATA Disk
     product: INTEL SSDSC2BB60
     physical id: 0.0.0
     bus info: scsi@2:0.0.0
     logical name: /dev/sdc
     version: D201
     serial: PHWL43160086600TGN
     size: 558GiB (600GB)
     capabilities: partitioned partitioned:dos
     configuration: ansiversion=5 sectorsize=4096 signature=0005b39f
*-disk
     description: ATA Disk
     product: INTEL SSDSC2BB60
     physical id: 0.0.0
     bus info: scsi@3:0.0.0
     logical name: /dev/sdd
     version: D201
     serial: PHWL4316001S600TGN
     size: 558GiB (600GB)
     capabilities: partitioned partitioned:dos
     configuration: ansiversion=5 sectorsize=4096 signature=00060dff

fgiunchedi closed subtask T90591: backfill metrics from tungsten to graphite1001 as Resolved.Apr 27 2015, 10:26 AM

So, should we

a) get a new box with more / bigger SSDs (most 2.5" cases have space for 8 SSDs), or
b) replace the existing SSDs with bigger ones?

fgiunchedi closed subtask T85908: replicate metric traffic in eqiad and codfw as Resolved.Jun 16 2015, 1:15 AM

fgiunchedi mentioned this in T113733: column family cassandra metrics size.Sep 30 2015, 3:01 PM

@fgiunchedi, sorry if I'm naggy on this, but I'm still wondering what the plan is for scaling graphite storage, especially with codfw and new services adding even more metrics.

Restricted Application added a subscriber: Matanya. · View Herald TranscriptSep 30 2015, 10:19 PM

• MZMcBride subscribed.Oct 7 2015, 5:38 AM

Addshore subscribed.Nov 13 2015, 9:43 PM

• GWicke added a subscriber: EBernhardson.Jan 27 2016, 12:51 AM

• GWicke added a project: Blocked-on-Operations.Jan 27 2016, 1:35 AM

FYI, there are some upcoming changes in Services that will use more disk space for metrics:

We are about to split RESTBase metrics by request type (internal, internal update, external), which will triple the number of metrics produced by RESTBase.
The ongoing conversion of Cassandra nodes to a multi-instance setup will increase the number of Cassandra metrics by a factor of 3-5.

• GWicke added a project: Services.Jan 27 2016, 6:32 PM

In T85451#1258826, @GWicke wrote:

So, should we

a) get a new box with more / bigger SSDs (most 2.5" cases have space for 8 SSDs), or
b) replace the existing SSDs with bigger ones?

I support something along these lines happening!

Addshore awarded a token.Feb 1 2016, 5:27 PM

Addshore mentioned this in T125408: Regularly & Automatically backup WMDE metrics stored in graphite.Feb 1 2016, 5:43 PM

I also support this change, collecting stats in graphite has shown to be quite powerful. Removing limitations around stats collection should be quite helpful.

Addshore added a project: WMDE-Analytics-Engineering.Feb 3 2016, 12:45 PM

Addshore moved this task from Incoming to Watching / External on the WMDE-Analytics-Engineering board.

intracer subscribed.Feb 24 2016, 9:16 AM

Change 277490 had a related patch set uploaded (by Filippo Giunchedi):
graphite: add 'big_users' route and cluster

https://gerrit.wikimedia.org/r/277490

gerritbot added a project: Patch-For-Review.Mar 15 2016, 10:54 AM

RobH changed the status of subtask T126253: additional graphite machines request, 1x per DC from Open to Stalled.Mar 24 2016, 8:25 PM

Change 281631 had a related patch set uploaded (by Filippo Giunchedi):
graphite: add cluster_servers graphite-web setting

https://gerrit.wikimedia.org/r/281631

• GWicke mentioned this in T132771: Cleanup Graphite Cassandra metrics.Apr 15 2016, 5:42 PM

RobH closed subtask T126253: additional graphite machines request, 1x per DC as Resolved.Apr 21 2016, 7:10 PM

• GWicke added a parent task: T134016: RESTBase Cassandra cluster: Increase instance count to 3.May 4 2016, 3:13 PM

Danny_B added a project: Tracking-Neverending.May 5 2016, 7:12 PM

fgiunchedi created subtask T134889: put additional graphite machines in service.May 10 2016, 2:45 PM

Change 281631 merged by Filippo Giunchedi:
graphite: add cluster_servers graphite-web setting

https://gerrit.wikimedia.org/r/281631

fgiunchedi mentioned this in rOPUP5d59ef23fc34: graphite: add cluster_servers graphite-web setting.May 10 2016, 3:16 PM

Change 277490 merged by Filippo Giunchedi:
graphite: add 'big_users' route and cluster

https://gerrit.wikimedia.org/r/277490

fgiunchedi mentioned this in rOPUP590524fadb93: graphite: add 'big_users' route and cluster.May 12 2016, 10:05 AM

Change 289440 had a related patch set uploaded (by Filippo Giunchedi):
graphite: introduce local carbon-c-relay daemon

https://gerrit.wikimedia.org/r/289440

the patch at https://gerrit.wikimedia.org/r/289440 add a local carbon-c-relay to be used for submitting graphite metrics on localhost. The local daemon provides load-balancing, failover and batching across graphite "frontends" in eqiad (graphite1001 and graphite1003 ATM).

Going forward the plan would be to add codfw cluster with graphite machines (graphite2001 and graphite2002) and turn off the current mirroring eqiad -> codfw that's happening on graphite1001.
This would simplify datacenter failover for graphite (not statsd) considerably since each machine would push metrics to both datacenters, transparently to clients which always write to localhost:2003.

Change 289635 had a related patch set uploaded (by Filippo Giunchedi):
svc: add graphite LVS addresses

https://gerrit.wikimedia.org/r/289635

Change 289636 had a related patch set uploaded (by Filippo Giunchedi):
lvs: add graphite service

https://gerrit.wikimedia.org/r/289636

Change 289637 had a related patch set uploaded (by Filippo Giunchedi):
graphite: add realserver class

https://gerrit.wikimedia.org/r/289637

the next series of patches uses LVS instead to perform load-balancing among graphite machines, clients would need to be pointed to graphite.svc.<site>.wmnet and each site mirrors to the other

Change 289440 abandoned by Filippo Giunchedi:
graphite: introduce local carbon-c-relay daemon

Reason:
abandoning in favor of LVS in Ibac0711dc

https://gerrit.wikimedia.org/r/289440

fgiunchedi mentioned this in rOPUPe40bc4a061b9: lvs: add graphite service.Jun 17 2016, 6:08 PM

fgiunchedi mentioned this in rOPUP306d29c1b528: graphite: add realserver class.

fgiunchedi mentioned this in rOPUP716a15ea9b32: graphite: introduce local carbon-c-relay daemon.

fgiunchedi mentioned this in rOPUP237da99e812d: graphite: add cluster_servers graphite-web setting.

fgiunchedi mentioned this in rOPUP484862c2475c: graphite: add 'big_users' route and cluster.

• Pchelolo moved this task from Backlog to watching on the Services board.Oct 12 2016, 6:49 PM

• Pchelolo edited projects, added Services (watching); removed Services.

fgiunchedi closed subtask T134889: put additional graphite machines in service as Resolved.Mar 1 2017, 2:28 PM

Addshore unsubscribed.Mar 1 2017, 4:59 PM