Page MenuHomePhabricator

migrate graphite to new hardware
Closed, ResolvedPublic

Description

once provisioned, we should migrate graphite there

Event Timeline

fgiunchedi claimed this task.
fgiunchedi raised the priority of this task from to Needs Triage.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added projects: Grafana, acl*sre-team.
fgiunchedi added subscribers: Aklapper, fgiunchedi, mark.
gerritbot subscribed.

Change 187663 had a related patch set uploaded (by Filippo Giunchedi):
introduce graphite raid10-lvm configuration

https://gerrit.wikimedia.org/r/187663

Patch-For-Review

Change 187664 had a related patch set uploaded (by Filippo Giunchedi):
provision graphite[12]001

https://gerrit.wikimedia.org/r/187664

Patch-For-Review

Change 187663 merged by Filippo Giunchedi:
introduce graphite raid10-lvm configuration

https://gerrit.wikimedia.org/r/187663

Change 187664 merged by Filippo Giunchedi:
provision graphite[12]001

https://gerrit.wikimedia.org/r/187664

Change 187683 had a related patch set uploaded (by Filippo Giunchedi):
graphite: explicit install python-twisted-core

https://gerrit.wikimedia.org/r/187683

Patch-For-Review

Change 187690 had a related patch set uploaded (by Filippo Giunchedi):
graphite: format /var/lib/carbon

https://gerrit.wikimedia.org/r/187690

Patch-For-Review

currently running rsync to transfer metrics changed in the last month to graphite1001, there's ~380k metrics changed in the last 30d and a parallel rsync is churning at ~3/s so ETA for the initial sync is ~1.5 days

Change 187683 merged by Filippo Giunchedi:
graphite: explicit install python-twisted-core

https://gerrit.wikimedia.org/r/187683

Change 188035 had a related patch set (by Filippo Giunchedi) published:
graphite: move to graphite1001

https://gerrit.wikimedia.org/r/188035

Patch-For-Review

Change 188036 had a related patch set (by Filippo Giunchedi) published:
graphite: move to graphite1001

https://gerrit.wikimedia.org/r/188036

Patch-For-Review

changes 188035 and 188036 should be enough to change traffic over from tungsten to graphite1001, the plan is to merge those and wait for dns and puppet to propagate and traffic to move to graphite1001

backfilling data is tricker however, carbonate doesn't lock whisper files by default so there's a chance for corruption if both carbonate and carbon-cache want to update the same file, see also https://github.com/jssjr/carbonate/issues/19

note also that gdash won't be migrated at the moment, I've run into some (I think) ruby 1.8 -> 1.9 and rubygems which I don't want to get blocked by:

root@graphite1001:/var/log/upstart# tail -15 /var/log/upstart/uwsgi_app-gdash.log
ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux]
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 161968 bytes (158 KB) for 1 cores
*** Operational MODE: single process ***
/usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require': cannot load such file -- gdash (LoadError)
	from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
	from /etc/gdash/config.ru:2:in `block in <main>'
	from /usr/lib/ruby/vendor_ruby/rack/builder.rb:55:in `instance_eval'
	from /usr/lib/ruby/vendor_ruby/rack/builder.rb:55:in `initialize'
	from /etc/gdash/config.ru:in `new'
	from /etc/gdash/config.ru:in `<main>'
	from /usr/lib/ruby/vendor_ruby/rack/builder.rb:49:in `eval'
	from /usr/lib/ruby/vendor_ruby/rack/builder.rb:49:in `new_from_string'
	from /usr/lib/ruby/vendor_ruby/rack/builder.rb:40:in `parse_file'

Change 188069 had a related patch set uploaded (by Filippo Giunchedi):
Make gdash's uWSGI config.ru Ruby 1.9-compatible

https://gerrit.wikimedia.org/r/188069

Patch-For-Review

Change 187690 merged by Filippo Giunchedi:
graphite: format /var/lib/carbon

https://gerrit.wikimedia.org/r/187690

Change 188035 merged by Filippo Giunchedi:
graphite: move to graphite1001

https://gerrit.wikimedia.org/r/188035

Change 188539 had a related patch set uploaded (by Filippo Giunchedi):
graphite: move to graphite1001

https://gerrit.wikimedia.org/r/188539

Patch-For-Review

Change 188539 merged by Filippo Giunchedi:
graphite: move to graphite1001

https://gerrit.wikimedia.org/r/188539

Change 188036 merged by Filippo Giunchedi:
graphite: move to graphite1001

https://gerrit.wikimedia.org/r/188036

Change 188563 had a related patch set uploaded (by Filippo Giunchedi):
gdash: move from tungsten to graphite1001

https://gerrit.wikimedia.org/r/188563

Patch-For-Review

Change 188563 merged by Filippo Giunchedi:
gdash: move from tungsten to graphite1001

https://gerrit.wikimedia.org/r/188563

Change 188567 had a related patch set uploaded (by Filippo Giunchedi):
webperf: handle missing 'duration' in schema

https://gerrit.wikimedia.org/r/188567

Patch-For-Review

Change 188567 merged by Ori.livneh:
webperf: handle missing 'duration' in schema

https://gerrit.wikimedia.org/r/188567

Change 188788 had a related patch set uploaded (by Filippo Giunchedi):
graphite: move gdash performance to graphite1001

https://gerrit.wikimedia.org/r/188788

Patch-For-Review

Change 188788 merged by Filippo Giunchedi:
graphite: move gdash performance to graphite1001

https://gerrit.wikimedia.org/r/188788

Change 188069 merged by Filippo Giunchedi:
Make gdash's uWSGI config.ru Ruby 1.9-compatible

https://gerrit.wikimedia.org/r/188069

Change 189504 had a related patch set uploaded (by Filippo Giunchedi):
gdash: fix graphite disk dashboard sda->md1

https://gerrit.wikimedia.org/r/189504

Patch-For-Review

Change 189504 merged by Filippo Giunchedi:
gdash: fix graphite disk dashboard sda->md1

https://gerrit.wikimedia.org/r/189504

fgiunchedi added subtasks: Restricted Task, Restricted Task.Feb 10 2015, 9:33 AM

graphite1001 in service at the moment, waiting for graphite2001 to be online to resolve this

faidon closed subtask Restricted Task as Resolved.Feb 11 2015, 10:14 AM
RobH closed subtask Restricted Task as Resolved.Feb 13 2015, 11:18 PM

also pending is backfill of metrics from tungsten via carbonate, but see https://github.com/jssjr/carbonate/issues/47 on why we can't do it straight away (or without shutting carbon-cache down anyway)

So, what's the next step here and when/how is it going to happen?

resolving this, graphite2001 has been deployed and I've moved the backfilling of metrics from tungsten to T90591 where it belong