We need to come up with aggregated metrics that we can move to prod collectors for indefinite storage.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Jgreen | T91508 [Epic] overhaul fundraising cluster monitoring | |||
Resolved | • cwdent | T152562 Port fundraising stats off Ganglia | |||
Declined | None | T175738 Long term storage for frack prometheus data | |||
Unknown Object (Task) | |||||
Resolved | • cwdent | T186073 Rack/setup frmon1001 | |||
Resolved | ayounsi | T198516 NAT and DNS for fundraising monitor host | |||
Resolved | Jgreen | T198648 Authentication for grafana |
Event Timeline
Sounds awesome!
re: indefinite storage, the global instance of Prometheus now has 1yr retention, likely to be moved to 2yrs.
We will look into aggregated stats again later but there were spare 1TB disks on the lvs servers so I moved the prometheus backend there and set a 2 year retention. Our rate of collection will probably increase, but at the current rate 1TB would last like 20 years, so we should have plenty of time to figure it out.
re: long term storage of data in Prometheus I wanted to expand on it also wrt hardware requirements in {T175364}. See https://phabricator.wikimedia.org/T180105#3759016 for a longer explanation but tl;dr is that the limiting factor for querying metrics in the past is loading up all datapoints for the query in memory. Since a single Prometheus instance doesn't downsample data it means that queries involving "many" metrics will have troubles looking back e.g. one year due to memory constraints.
Closing this as wontfix because it appears to be a larger project than we want to take on due to prometheus's design limitations--both in terms of the downsampling issue fgiunchedi mentions above, plus the project's lack of interest in backward storage scheme compatibility.