Long term storage for frack prometheus data
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• cwdent
	Sep 12 2017, 7:54 PM

Description

We need to come up with aggregated metrics that we can move to prod collectors for indefinite storage.

Related Objects
Search...

Status	Assigned	Task
Resolved	Jgreen	T91508 [Epic] overhaul fundraising cluster monitoring
Resolved	• cwdent	T152562 Port fundraising stats off Ganglia
Declined	None	T175738 Long term storage for frack prometheus data
		Unknown Object (Task)
Resolved	• cwdent	T186073 Rack/setup frmon1001
Resolved	ayounsi	T198516 NAT and DNS for fundraising monitor host
Resolved	Jgreen	T198648 Authentication for grafana

Event Timeline

• cwdent created this task.Sep 12 2017, 7:54 PM

Sounds awesome!

re: indefinite storage, the global instance of Prometheus now has 1yr retention, likely to be moved to 2yrs.

Jgreen added a subtask: Unknown Object (Task).Sep 20 2017, 2:17 PM

We will look into aggregated stats again later but there were spare 1TB disks on the lvs servers so I moved the prometheus backend there and set a 2 year retention. Our rate of collection will probably increase, but at the current rate 1TB would last like 20 years, so we should have plenty of time to figure it out.

Joe unsubscribed.Sep 25 2017, 6:18 AM

re: long term storage of data in Prometheus I wanted to expand on it also wrt hardware requirements in {T175364}. See https://phabricator.wikimedia.org/T180105#3759016 for a longer explanation but tl;dr is that the limiting factor for querying metrics in the past is loading up all datapoints for the query in memory. Since a single Prometheus instance doesn't downsample data it means that queries involving "many" metrics will have troubles looking back e.g. one year due to memory constraints.

reopening for visibility re: last comment, @cwdent @Jgreen

K4-713 subscribed.Nov 14 2017, 7:03 PM

RobH closed subtask Unknown Object (Task) as Resolved.May 31 2018, 4:31 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 9:28 PM

(Resetting assignee as @cwdent has left WMF)

Jgreen moved this task from Triage to Backlog on the fundraising-tech-ops board.Feb 19 2020, 10:50 PM

Closing this as wontfix because it appears to be a larger project than we want to take on due to prometheus's design limitations--both in terms of the downsampling issue fgiunchedi mentions above, plus the project's lack of interest in backward storage scheme compatibility.

Long term storage for frack prometheus dataClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Long term storage for frack prometheus data
Closed, DeclinedPublic
Actions

Related Objects
Search...