Page MenuHomePhabricator

scale graphite deployment (tracking)
Closed, InvalidPublic

Description

tracking graphite scaling work using this ticket, see also https://wikitech.wikimedia.org/wiki/Graphite/Scaling for a general overview of the options and http://etherpad.wikimedia.org/p/graphitetodo for a scratchpad/ramblings.

currently on the plate:

future plans:

  • expand the setup beyond 1+1 machine using graphite clustering
    • route/cluster metrics with carbon-c-relay and carbon consistent hashing
    • use carbonate to rebalance metric data around also using the same consistent hashing
  • consider hybrid caching with bcache (i.e. spinning disks + block caching on ssd)

Related Objects

Event Timeline

fgiunchedi raised the priority of this task from to Needs Triage.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added projects: acl*sre-team, Grafana.
fgiunchedi subscribed.

Let's update this ticket with sub tasks / blockers for the tasks mentioned below. Some of them are in old RT links of course, but let's change that into the newer Phab tickets...

yep, created the subtasks now to migrate to new hardware

@fgiunchedi, what is the plan for increasing SSD space? If we plan to continue using a single graphite node per DC, can we order larger SSDs and replace the current ones to have some headroom for more metrics?

Host uranlum has two SSDs installed, and the R610 planform can usually support 4 2.5" disks installed.

If you guys wanted to upgrade/add/replace the disks, please create a hardware-request task linked to this, requesting such.

Thanks.

So, I looked at uranium, not graphite1001. graphite1001 is an R420 with 4 disks already installed.

So this host actually CANNOT add additional SSDs without removing existing disks.

robh@graphite1001:~$ sudo lshw -class disk

*-disk                  
     description: ATA Disk
     product: INTEL SSDSC2BB60
     physical id: 0.0.0
     bus info: scsi@0:0.0.0
     logical name: /dev/sda
     version: D201
     serial: PHWL4316007M600TGN
     size: 558GiB (600GB)
     capabilities: partitioned partitioned:dos
     configuration: ansiversion=5 sectorsize=4096 signature=0004e927
*-disk
     description: ATA Disk
     product: INTEL SSDSC2BB60
     physical id: 0.0.0
     bus info: scsi@1:0.0.0
     logical name: /dev/sdb
     version: D201
     serial: PHWL43160035600TGN
     size: 558GiB (600GB)
     capabilities: partitioned partitioned:dos
     configuration: ansiversion=5 sectorsize=4096 signature=000559e0
*-disk
     description: ATA Disk
     product: INTEL SSDSC2BB60
     physical id: 0.0.0
     bus info: scsi@2:0.0.0
     logical name: /dev/sdc
     version: D201
     serial: PHWL43160086600TGN
     size: 558GiB (600GB)
     capabilities: partitioned partitioned:dos
     configuration: ansiversion=5 sectorsize=4096 signature=0005b39f
*-disk
     description: ATA Disk
     product: INTEL SSDSC2BB60
     physical id: 0.0.0
     bus info: scsi@3:0.0.0
     logical name: /dev/sdd
     version: D201
     serial: PHWL4316001S600TGN
     size: 558GiB (600GB)
     capabilities: partitioned partitioned:dos
     configuration: ansiversion=5 sectorsize=4096 signature=00060dff

So, should we

a) get a new box with more / bigger SSDs (most 2.5" cases have space for 8 SSDs), or
b) replace the existing SSDs with bigger ones?

@fgiunchedi, sorry if I'm naggy on this, but I'm still wondering what the plan is for scaling graphite storage, especially with codfw and new services adding even more metrics.

FYI, there are some upcoming changes in Services that will use more disk space for metrics:

  • We are about to split RESTBase metrics by request type (internal, internal update, external), which will triple the number of metrics produced by RESTBase.
  • The ongoing conversion of Cassandra nodes to a multi-instance setup will increase the number of Cassandra metrics by a factor of 3-5.

So, should we

a) get a new box with more / bigger SSDs (most 2.5" cases have space for 8 SSDs), or
b) replace the existing SSDs with bigger ones?

I support something along these lines happening!

I also support this change, collecting stats in graphite has shown to be quite powerful. Removing limitations around stats collection should be quite helpful.

Change 277490 had a related patch set uploaded (by Filippo Giunchedi):
graphite: add 'big_users' route and cluster

https://gerrit.wikimedia.org/r/277490

Change 281631 had a related patch set uploaded (by Filippo Giunchedi):
graphite: add cluster_servers graphite-web setting

https://gerrit.wikimedia.org/r/281631

Change 281631 merged by Filippo Giunchedi:
graphite: add cluster_servers graphite-web setting

https://gerrit.wikimedia.org/r/281631

Change 277490 merged by Filippo Giunchedi:
graphite: add 'big_users' route and cluster

https://gerrit.wikimedia.org/r/277490

Change 289440 had a related patch set uploaded (by Filippo Giunchedi):
graphite: introduce local carbon-c-relay daemon

https://gerrit.wikimedia.org/r/289440

the patch at https://gerrit.wikimedia.org/r/289440 add a local carbon-c-relay to be used for submitting graphite metrics on localhost. The local daemon provides load-balancing, failover and batching across graphite "frontends" in eqiad (graphite1001 and graphite1003 ATM).

Going forward the plan would be to add codfw cluster with graphite machines (graphite2001 and graphite2002) and turn off the current mirroring eqiad -> codfw that's happening on graphite1001.
This would simplify datacenter failover for graphite (not statsd) considerably since each machine would push metrics to both datacenters, transparently to clients which always write to localhost:2003.

Change 289635 had a related patch set uploaded (by Filippo Giunchedi):
svc: add graphite LVS addresses

https://gerrit.wikimedia.org/r/289635

Change 289636 had a related patch set uploaded (by Filippo Giunchedi):
lvs: add graphite service

https://gerrit.wikimedia.org/r/289636

Change 289637 had a related patch set uploaded (by Filippo Giunchedi):
graphite: add realserver class

https://gerrit.wikimedia.org/r/289637

the next series of patches uses LVS instead to perform load-balancing among graphite machines, clients would need to be pointed to graphite.svc.<site>.wmnet and each site mirrors to the other

Change 289440 abandoned by Filippo Giunchedi:
graphite: introduce local carbon-c-relay daemon

Reason:
abandoning in favor of LVS in Ibac0711dc

https://gerrit.wikimedia.org/r/289440

graphite1003 is alerting in Icinga because of disk space 99% used on /var/lib/carbon

but because that's 1.6T , 99% means still 20G left.

/dev/mapper/graphite1003--vg-carbon 1.6T 1.6T 20G 99% /var/lib/carbon

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=graphite1003&service=Disk+space

@Dzahn indeed, and there's ~500G left on the vg still. I'll debug this with @Eevans but I suspect it is related to T163936: Latency metrics missing

Change 289636 abandoned by Filippo Giunchedi:
lvs: add graphite service

Reason:
Not needed

https://gerrit.wikimedia.org/r/289636

Change 289635 abandoned by Filippo Giunchedi:
svc: add graphite LVS addresses

Reason:
Not doing this for now

https://gerrit.wikimedia.org/r/289635

Change 289637 abandoned by Filippo Giunchedi:
graphite: add realserver class

Reason:
Not doing this for now

https://gerrit.wikimedia.org/r/289637