Page MenuHomePhabricator

Create a Graphite instance in the Analytics cluster
Closed, DeclinedPublic

Description

After discussion with various people this seems like the best way forward.
Creating a graphite instance on the analytics cluster for collecting statistics rather than using the instance used by operations.

For WMDE this will primarily be used for storing daily metrics:
These metrics do not yet exist but they will be at:

  • wikidata.social.facebook.likes
  • wikidata.entityusage.enwiki.S

for example.....

See a related patch and ticket trying to change the current graphite instance to allow tracking of this such data:

Ideally my time-scale for this would be before the end of this year in order to make use of it for the Wikidata dashboard.

Also in my opinion there is no need for a statsd deamon, as we can simply talk to graphite directly!
http://graphite.readthedocs.org/en/latest/feeding-carbon.html

Event Timeline

Addshore raised the priority of this task from to Needs Triage.
Addshore updated the task description. (Show Details)
Addshore added subscribers: Addshore, Ottomata, Nuria and 2 others.
Addshore set Security to None.

Where is the proof of concept for Graphite exactly? I am not sure why
exactly Graphite is assumed to be the best solution. Can you explain your
reasoning here? Is Whisper better than a Hadoop solution like HBase?

Graphite is already puppetized on the cluster, hence creating another instance should be /trivial/.
I am not necessarily saying it is the /best/ solution.

Alternatives were briefly mentioned:

However we want to store data soon / now.
@Ottomata talked about "we can do a whole choose a big thing later when we need to"

None of the solutions / practises that we currently have in place are as easy as graphite.

If time is the deciding issue, then the question should be how much work is it to puppetize HBase?

As indicated here by the developer of OpenTSDB, HBase can scale way beyond Whisper/Graphite. And, this should be the driving criteria, since in theory, we could load tons of stuff into this data store.

Also, since (I assume) that the design intent is to not use Graphite's built in graphing capabilities in favor of Shiny and R, the argument for keeping Graphite is not that strong. I think that HBase is definitely the way to go here. It also runs on Java, which is a huge plus.

I am not sure we are worried about scale here, data size is small so many tech solutions would work.

http://hstack.org/hstack-automated-deployment-using-puppet/

Puppet recipes for Hadoop, HBase and Zookeeper: https://github.com/hstack/puppet
(very old and needs a lot of work..)

It would be nice to be able to also use our metrics in the same graphs as those we already use on grafana.w.o . I don't know what others motivations are, but if I can't use it as a data source on grafana.w.o this won't cover the use case I had in mind.

Graphana and OpenTSDB work together.
http://docs.grafana.org/datasources/opentsdb/

Yes, Graphana is a great front end. But using R for vector manipulation has definite benefits though. I think both Shiny and Graphana clients can use the same HBase data sources readily.

@JanZerebecki what was the use case you had in mind?
Grafana can support multiple graphite backends.
Shiney can also draw data from literally wherever you want

This comment was removed by Addshore.

An additional graphite as an additional grafana backend should work. It seems one can create a graph with 2 Metric Queries where each goes to a different Datasource. Like overlaying entity usage and scribuntu cpu usage (no idea if that would actually be useful, too sleepy).

TBH I'm not sure why you'd need a separate graphite instance in analytics cluster for this, can you elaborate?

@fgiunchedi Here are some bits of one of the discussions I had in IRC.

3:20 PM <_joe_> addshore: whatever we want to keep "forever", should be somewhere else than graphite
3:20 PM <_joe_> or in general any ops monitoring tool
3:21 PM <addshore> well _joe_ graphite does exactly what I want it to do, the only issue is the config
3:22 PM <_joe_> addshore: well, theory and practice tend to differ
3:23 PM <_joe_> addshore: my point being that any opsen if given the choice between a 10 minutes downtime of a monitoring tool and dropping old data will choose the latter
3:23 PM <_joe_> because it's a tool we use for monitoring production /now/

The main thing I drew from this is that "any opsen if given the choice between a 10 minutes downtime of a monitoring tool and dropping old data will choose the latter"
And because of this a separate instance that wont tread on the toes of ops at all makes sense.

@fgiunchedi Here are some bits of one of the discussions I had in IRC.

3:20 PM <_joe_> addshore: whatever we want to keep "forever", should be somewhere else than graphite
3:20 PM <_joe_> or in general any ops monitoring tool
3:21 PM <addshore> well _joe_ graphite does exactly what I want it to do, the only issue is the config
3:22 PM <_joe_> addshore: well, theory and practice tend to differ
3:23 PM <_joe_> addshore: my point being that any opsen if given the choice between a 10 minutes downtime of a monitoring tool and dropping old data will choose the latter
3:23 PM <_joe_> because it's a tool we use for monitoring production /now/

The main thing I drew from this is that "any opsen if given the choice between a 10 minutes downtime of a monitoring tool and dropping old data will choose the latter"
And because of this a separate instance that wont tread on the toes of ops at all makes sense.

I see what you mean, my take on that is graphite is meant for (operational) metrics not data warehousing. If you can recreate the daily statistics from data you already have then graphite is fine IMO. If you care a lot about having all data store permanently and be able to retrieve it exactly as you've put it in, graphite isn't really suited (regardless of a separate instance or not)
Maybe you can expand a little bit more on T117735: Track all Wikidata metrics currently gathered in Graphite rather than SQL and TSVs on what are the use cases for this data to see if graphite fits?

Yes, we should not use graphite as a primary storage for metrics that we can not recreate. Using graphite as a cache only that is filled from archiving storage would work for me.

Archival storage (for data that can not be recreated) in MariaDB can work in practice for this case even though it is not designed for that. Appending only to csv/tsv files might be another option for archival, as long as those files are in a backup.

Or phrased differently: For any solution for archiving of metrics (those we can not recreate) we need to provide an answer to the question: How are backups done?

Per http://graphite.readthedocs.org/en/latest/overview.html Graphite does exactly what we need here.
It "Stores numeric time-series data" and "Renders graphs of this data on demand" (as well as providing an API for data access.

We want to says that the value of X at point in time Y was Z, again exactly what graphite accepts, I mean the graphite plain text protocol is:

<metric path> <metric value> <metric timestamp>

Of course if people feel this is really unlikely to happen then we can just close this and write our own small solution.

I'm not very opinionated here, but I think an analytics graphite instance will be very useful for other things than just this ticket, especially if these metrics are backed up by data in HDFS, and graphite is used as a visualization and rollup tool.

@Addshore to clarify, more than functionality I was pointing out guarantees about the data stored. if the metrics are also being archived to hdfs for example so it is possible to dump/load into graphite then IMO that's acceptable.
re: analytics graphite instance, I think there's value in a single shared instance for ease of use, even though for example grafana supports mixed dashboards so it is possible to collate multiple graphite sources

I agree with T117732#1793386. (Though I have no clear opinion on whether it should be a different instance than the one used for operations metrics.)
@Addshore Having the primary storage of these metrics in HDFS, which I assume has the backup question solved, sound like another way to address my concern from above. Which one should we use for the Wikidata related metrics that are not generated in Hadoop?

If you are going to use HDFS, why not just use HBase instead of Graphite?

I think there's value in a single shared instance for ease of use

Well, this was also my initial thought. Until Joe said:

any opsen if given the choice between a 10 minutes downtime of a monitoring tool and dropping old data will choose the latter

@JanZerebecki I imagine that wherever we put the data, we will backup.
Right now they are mainly stored in SQL tables... which I backup / export.
If they were started in graphite I would likely run a daily / weekly export too.
In-fact even if they were primarily in HDFS I would still want them backed up elsewhere.

Which one should we use for the Wikidata related metrics that are not generated in Hadoop?

Well, as it stands basically no metrics are generated in hadoop.

We simply want to store time series data / numbers and timestamps. Have them easily writeable, accessible, backed up, integrated into other solutions. Graphite + a backup script / weekly / daily export does all of this.

If months down the line we decide on a potential better solution that the analytics team may also like migrating to it should be trivial.

And again, at this early stage one of the main reasons for graphite from my point of view is to save time, we have it puppetized, in-fact we already have an instance. If we can not use that instance, we can roll another instance. Getting a new solution through security review, puppetization, deployment etc would take more time.

If not HBase, what about Cassandra? This is already puppetized.

As far as I understand Graphite, using it as a source for the backup means that we are loosing data after the retention cutoff. So the source for the backup needs to be somewhere before things are written into Graphite?

Doing HBase would be a much larger project than just spinning up a graphite instance. We may eventually evaluate larger systems like HBase and/or OpenTSDB or InfluxDB, etc., but that is out of scope for this simple use case.

Cassandra is a complicated clustered solution, graphite will just require a simple single instance running and intaking events somewhere.

As far as I understand Graphite, using it as a source for the backup means that we are loosing data after the retention cutoff. So the source for the backup needs to be somewhere before things are written into Graphite?

Yes, but we can simply set retention to 100 years and leave it for future us to worry about.
At a guess we won't still be using graphite, at least not in this version in 100 years.

I am not sure why this is considered to be "a simple use case" since as mentioned in T117735 there are at least two different requirements. Content metrics require long term (non-decaying) storage, operational metrics do not.

Whisper (Graphite's database) is not robust and has a fixed size. Even the documentation says it is not "disk space efficient". Of course, if we assume that the need is only to record a small number of data points with a low resolution, none of this matters.

The added complexity of introducing backups and HDFS,, etc. to the Graphite proposition does not seem "simple". Also, the puppet module would still need to be reconfigured/modified as @Addshore tried to do, for long term retention, but this does not solve the archiving problem. There has to be a built in way to preserve and "snapshot" the database, or else it could be a real pain to restore. And, in the interim period from snapshot to restoration all measurements would be lost, unless it were on a cluster.

As far as I know, Cassandra can also run on a single instance, it does not need a cluster.

Content metrics require long term (non-decaying) storage, operational metrics do not.

Both cases can be covered by configuration

Of course, if we assume that the need is only to record a small number of data points with a low resolution, none of this matters.

That is my current assumption, backed by having a limited number of things to record (an incredibly small amount compared with what is on the current graphite instance.

The added complexity of introducing backups and HDFS

Well, we need not add HDFS. Backups can simply call the API and dump a TSV, which I guess could easily be stored in HDFS, or somewhere else. Or just a cron backing up actually graphite database. This could even just live on labs.?

I really don't know why we are all expecting graphite to unexpectedly loose our data? If you configure it to keep the data for 100 years / 25 years / whatever it will. I see no reports of parts of graphites databases vanishing when not already configured to do so.

As far as I understand Graphite, using it as a source for the backup means that we are loosing data after the retention cutoff. So the source for the backup needs to be somewhere before things are written into Graphite?

Yes, but we can simply set retention to 100 years and leave it for future us to worry about.

And when this setting changes without us noticing until the backup is overwritten, it overwrites something that is supposed to be append only. Seems simpler to write to Graphite and something else in addition. But if you really want to backup from Graphite, I guess you could keep infinite backup copies to prevent this. Then one also needs to make sure to run the backup each time after we write new data to Graphite.

And when this setting changes without us noticing until the backup is overwritten, it overwrites something that is supposed to be append only

Quote from http://graphite.readthedocs.org/en/latest/config-carbon.html :

This retention is set at the time the first metric is sent.

This means that if the configuration changes retention will remain the same for metrics and thus the data will remain.

I really don't know why we are all expecting graphite to unexpectedly loose our data?

I really don't know why we are all expecting this winter to unexpectedly get colder than the summer?
That German public long distance train operators don't expect the winter, doesn't mean you should make the same error.
Something being designed for never permanently keeping data and always deleting it after a certain time would surprise me if it were good for permanent long term storage. Notice how there is no indefinite retention setting.
But see my previous post on how to do backups from Graphite in a way that sufficiently works around that.

As said above whatever the solution we will want to take backups in some form or another.
Having to do backups of graphite is trivial, adding data back if something goes wrong is trivial.

But see my previous post on how to do backups from Graphite in a way that sufficiently works around that.

Yes, this would work. And we would need to do a similar thing for any backup solution, not matter what we are backing up.
I mean, as a really crude second storage method we could simply write everything we send to graphite to a text file, then re adding in the case something goes wrong would be as simple as replaying the file.

Addshore changed the task status from Open to Stalled.Nov 13 2015, 2:21 PM

After many discussions in many places we have decided to try and push forward with the config change and start storing data in this graphite instance on a daily basis.

Per discussions revealing there isn't really any guarantee of persistence in that graphite instance we will simply backup all data we add there.

We will also look at a better solution on the analytics cluster in the longer term.

Addshore claimed this task.
Addshore moved this task from Watching / External to Done on the WMDE-Analytics-Engineering board.