Page MenuHomePhabricator

Capacity planning/estimation for Thanos
Open, Needs TriagePublic

Description

CapEx time is upon us, this task will track the following:

  • Capacity estimation/planning for Thanos needs in terms of object storage, also in light of T351927
  • Estimation of thanos compact space needs, which brings titan disk space utilization close to maximum at the moment, so we'll likely need to add some capacity there too

Object storage requirement estimation

Thanos data is written by Prometheus in the form of raw datapoints blocks to object storage, each block represents a few hours of data. The raw data blocks are then downsampled to lower resolutions (5m, 1h) and written back to storage. All hours-long blocks (all resolutions) are also compacted into 14 days blocks for space savings. To each resolution we then apply a retention policy to delete older blocks.

Due to storage space pressure, in T351927 we have implemented additional logic to block cleanup: namely to take into account the fact that Prometheus in eqiad and codfw is replicated (two Prometheus hosts per each site), thus we also have blocks of very similar data which can be deleted if need be. We did do the deletion for blocks older than 3 months, hence I'll be considering blocks newer than that below to not account for the extra deletion.

Default retention strategy

This is the easiest and the strategy implemented by Thanos: we only delete blocks when they are too old (i.e. past their retention period).

For the last ~two months we get the following usage:

# daysGBsresolutionGB/day
76111430s (raw)146
7384335m115
7115951h22

Extrapolating from that we get:

Current retention

This is the retention policy we have configured in Puppet as of today.

# weeksGBsresolution
54551880s
2702173505m
270415801h

Yielding a grand total of ~314TB needed. Thanos storage size is ~130 TB total, meaning we'd need to more than double the capacity (!) not a great situation.

Proposed retention and hardware needs

As a reasonable compromise I think we can do the following: keep 0s and 5m data for slightly longer than a year (so year over year comparisons are possible) so about 60w, and have 1h data for longer since it is significantly less expensive to keep. In other words (rounding up numbers)

# weeksGBsresolution
60~620000s
60~500005m
280~430001h

Or ~155TB total, meaning we need to add about 30-40TB to current Thanos storage.

In terms of hardware this translates in an additional two hosts of the 24x 8TB class which will provide plenty of headroom (an additional ~100TB usable). We could also probably get away with two hosts of the 12x 4TB class (i.e. what thanos-be is now) though that wouldn't provide very much headroom.

Titan hosts storage

The titan hosts run block compaction processes described above, and require temporary space to write the compacted blocks to disk before upload. The hosts have been managing though they occasionally get tight on disk space, for this reason we should procure additional SSD to install on these to get ahead of the curve.

Hardware needs

We'll need 2x SSD per host (across 4x hosts) so total 8x SSD of 500GB capacity or greater to install in already exists hosts.

Event Timeline

cc @MatthewVernon and SRE-swift-storage for your input re: capacity planning and hardware needs for thanos-be, let me know what you think!

I think the proposed table should look like this?

# weeksGBsresolution
60~620000s
60~500005m
280~430001h

I.e. 60W (as per text, a bit over a year), not 50W as you currently have? My back-of-an-envelope calculation has the GBs figures about right, though, so I don't think it changes the thrust of your argument.

I think on your numbers two 12x4 systems would be likely insufficient (or at least cutting it fine, which I'd rather not do), but two 24x8 systems would be good. It might be worth moving them to the new-style disk usage we have for recent ms-be* nodes too? i.e. JBOD rather than a set of 1-disk RAID-0 arrays. I bring that up because it changes how DC-ops configure the nodes, so it's worth remembering when ordering hw.

Obviously if you think there's value in continuing with the current retention policy, there's no reason we couldn't do that beyond budget ( :-) ), but I get the impression you don't.

I think the proposed table should look like this?

# weeksGBsresolution
60~620000s
60~500005m
280~430001h

I.e. 60W (as per text, a bit over a year), not 50W as you currently have? My back-of-an-envelope calculation has the GBs figures about right, though, so I don't think it changes the thrust of your argument.

Thank you, I've fixed the table to read 60w instead.

I think on your numbers two 12x4 systems would be likely insufficient (or at least cutting it fine, which I'd rather not do), but two 24x8 systems would be good. It might be worth moving them to the new-style disk usage we have for recent ms-be* nodes too? i.e. JBOD rather than a set of 1-disk RAID-0 arrays. I bring that up because it changes how DC-ops configure the nodes, so it's worth remembering when ordering hw.

Indeed, I'm +1 on moving to the JBOD configuration

Obviously if you think there's value in continuing with the current retention policy, there's no reason we couldn't do that beyond budget ( :-) ), but I get the impression you don't.

Yeah if we can get the bigger systems then I'm definitely for extending the retention beyond 60w as far as space allows, accounting for other thanos-swift users too of course.

Moving off Q4 board since we have hw in capex spreadsheet and it'll be coming next FY