Page MenuHomePhabricator

Prometheus/Grafana snapshots for long term storage
Closed, ResolvedPublic

Description

We are doing it wrong with super long retention in Prometheus, the data is meant to be ephemeral. Let's figure out a way to save an overview, maybe with the API, into mysql or something. If it is data Grafana can also read, that would be a win.

Event Timeline

My current approach is exporting grafana/prometheus data as json and using the SimpleJSON back-end. This will also require:

I am also going to try plugging mysql directly into grafana just to see what happens

(Resetting assignee as @cwdent has left WMF)

Added cross colo snapshots of the grafana db and data in case of catastrophic machine failure.

[frack::puppet] 6cae0b58 Add grafana dir to cross host sync on monitoring role

We are currently storing ~765 days of metrics with our current retention settings. They are currently only present that far back on frmon1001 as frmon2001 wasn't put into service until mid 2019. That level of retention takes up ~75G of disk space and as such is not currently mirrored or archived off the hosts.

We are currently storing ~765 days of metrics with our current retention settings. They are currently only present that far back on frmon1001 as frmon2001 wasn't put into service until mid 2019. That level of retention takes up ~75G of disk space and as such is not currently mirrored or archived off the hosts.

The full /srv/prometheus tree is backed up to the peer server every 4 hours, using an rsync snapshot. So frmon2001 has historical snapshots of frmon1001's prometheus data.

Note for production Thanos is in use: T252186

Jgreen claimed this task.
Jgreen moved this task from Backlog to Done on the fundraising-tech-ops board.

Closing as resolved because at least the long term storage part has been done. We're still probably misusing prometheus as a multi-year storage engine, but at our scale it's hard to justify spinning up a service such as Thanos if prometheus alone will suffice.