Page MenuHomePhabricator

Prometheus/Pyrra: establish backfill process for recording rules
Open, HighPublic

Description

As we onboard SLOs Pyrra will automatically create a set of recording rules. Prometheus has a full history of the metrics used in our SLOs, however the recording rules used behind the scenes will only collect data from the time when the recording rule is deployed forwards. In order to see historical data via recording rules, we'll need to backfill these recording rule metrics somehow.

Upstream has an open request to make this process automatic within prometheus/pyrra https://github.com/prometheus/prometheus/issues/10202 and describes the issue well.

In the mean time, let's identify a workable in house process to backfill data into these recording rules so that new (or recently changed) SLOs will show a full history.

Event Timeline

Most promising option at the moment looks like https://prometheus.io/docs/prometheus/latest/command-line/promtool/#promtool-tsdb-create-blocks-from-rules

Going to give this a shot with one of our existing recording rules

Thank you for looking into this, I'm definitely +1 on trying, though please either test in Pontoon end-to-end or don't feed back the resulting blocks to Prometheus production as I'm not sure what might happen especially on the Thanos side (sidecar + compator)

The initial approach I'm thinking of is running promtool tsdb create-blocks-from rules against the recording rules created by pyrra (since they are new and have no production dependencies yet) and output that into an unused scratch directory. Then transfer that over to pontoon and experiment with loading/using them, and go from there. Sound alright?

The initial approach I'm thinking of is running promtool tsdb create-blocks-from rules against the recording rules created by pyrra (since they are new and have no production dependencies yet) and output that into an unused scratch directory. Then transfer that over to pontoon and experiment with loading/using them, and go from there. Sound alright?

+1 sounds good to me!

Change 967969 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] prometheus: apt::pin prometheus package to bullseye-backports

https://gerrit.wikimedia.org/r/967969

Mentioned in SAL (#wikimedia-operations) [2023-10-23T18:31:57Z] <herron> sretest1001:~/tmp/backfill$ promtool tsdb create-blocks-from rules --start 1672531200 --end 1698080718 --url http://prometheus.svc.eqiad.wmnet/ops/ logstash-requests.yaml T349521

promtool tsdb create-blocks-from rules is looking promising so far. Here's an slo recording rule that was deployed 2 days ago with history going back 12 weeks:

Screen Shot 2023-10-23 at 5.16.44 PM.png (1×4 px, 313 KB)

What I did was export the rules generated by pyrra to sretest (because create-blocks-from rules needs a newer prometheus version than whats installed in production) and executed the logged command above. That ran for 15-20 minutes, then I copied the data generated to a prometheus host, adjusted permissions and copied it into the metrics location.

Notes:

  1. One set of pyrra recording rules created about 500M worth of data going back to the beginning of 2023
  2. Importing blocks required enabling --storage.tsdb.allow-overlapping-blocks on the prometheus instance
  3. Prometheus throws too many open files errors when configured with 32k max open files limit, worked ok after increase to 500k limit.
  4. Prometheus immediately deleted blocks exceeding the tsdb retention value, and the thanos sidecar threw a bunch of no such file errors while attempting to read older blocks. We will likely need to temporarily extend the retention on the instance being backfilled, or figure out some sort of short lived prometheus/sidecar for backfilling

Change #967969 abandoned by Herron:

[operations/puppet@production] prometheus: apt::pin prometheus package to bullseye-backports

Reason:

https://gerrit.wikimedia.org/r/967969

With T352756 T359879 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1017873 in mind I think It'd be worth spending some time here to work out a strategy for bringing backfilled metrics into production.

More specifically with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1017873 we've changed the label set associated with existing thanos rule recording rules, and it'd be ideal to re-generate these metrics for consistency (and to fill in some operational gaps)

The testing in T349521#9274342 was successful in creating and loading backfilled blocks, however it hasn't been brought into production yet. Testing there was done querying production prometheus to create the backfill data, and then backfilled into the pontoon testing environment.

However, I can say that to map the testing approach into production we would need a prometheus instance with...

  • long retention (>= the span being backfilled)
  • high open file limit
  • allow-overlapping-blocks enabled*

Creating a backfill or utility prometheus instance with a thanos sidecar comes to mind. We could leave that instance tuned with the settings above, and write some procedures about how to generate and drop tsdb backfill files into place as needed for onboarding and changing recording rules. I'm thinking the instance would normally be in an idle state, activated only when metrics are being backfilled.

*Thinking out loud we could also consider using a replica label to handle overlapping blocks due to backfill. Backfilled metrics could be inserted with replica=backfill, or replica=z, or something like that to automatically handle the case where existing metrics and backfilled metrics overlap for a portion of the time range being backfilled.

@fgiunchedi @elukey interested in what you think about this, or if a different approach comes to mind?

Following up from a chat yesterday:

The idea of creating backfilled blocks is sound, although I think we can get away with uploading said blocks straight to thanos (making sure we're using distinct labels with e.g. recoder=backfill) and they will be compacted and available as usual (to be tested!)

Very ignorant about the internals but the procedure seems sound! In the ML case, we could live without backfilling the previous quarters/SLO time series, the more pressing thing is to start from a clean state (without gaps etc..). Thanks for the work!

elukey triaged this task as High priority.Jul 1 2025, 12:28 PM
elukey moved this task from Tooling Backlog to Observability Q1 on the SRE-SLO board.

Change #1171591 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] role::titan: install promtool

https://gerrit.wikimedia.org/r/1171591

Change #1171591 merged by Herron:

[operations/puppet@production] role::titan: install promtool

https://gerrit.wikimedia.org/r/1171591

Following up from a chat yesterday:

The idea of creating backfilled blocks is sound, although I think we can get away with uploading said blocks straight to thanos (making sure we're using distinct labels with e.g. recoder=backfill) and they will be compacted and available as usual (to be tested!)

Coming back around to try this approach in T400071: Clear & Backfill Tonecheck Pyrra Metrics and not having much luck uploading direct to Thanos. The backfill blocks upload successfully, but don't appear in queries and cause the thanos compactor to halt due to overlapping blocks. I've tried uploading with external labels recorder=backfill and replica=backfill, passed as --label to thanos upload. It seems that promtool tsdb create-blocks-from rules itself is generating overlapping blocks.

My hunch is that the previous test using --storage.tsdb.allow-overlapping-blocks was a happy accident that worked around overlaps by letting the local prometheus instance compact them before uploading onward to Thanos. I haven't yet identified a straightforward way to compact the overlapping blocks manually (or not generate them in the first place) which has me thinking that a lightweight prometheus instance may not be so bad, unless there are thoughts on other things to try first.

In that case yes it seems an ad-hoc prometheus instance to run compaction on blocks might be viable, cfr https://github.com/prometheus/prometheus/issues/11341 and https://github.com/prometheus/prometheus/issues/9465

T400071#11034605 steps through a backfill process with an ad-hoc prometheus (and ad-hoc sidecar) that worked to upload backfilled blocks to Thanos.

The biggest issue I've notice so far is that the backfill data appears and disappears in some queries as they are refreshed. But it doesn't seem consistent across all queries that fetch backfilled data. For example the first query in https://w.wiki/Eqq5 intermittently returns the backfilled data from Jul 01-02, while the second query consistently returns.

Thinking out loud -- I don't think there are overlapping blocks since backfill was uploaded with external labels replica=backfill and recorder=backfill it should be unique. I wonder if its anything to do with queries being routed to different instances via load balancing, and potentially cached results? Maybe worth flushing memcached to see if it helps?

Yes indeed trying a memcached flush seems easy and worth a try

Change #1182886 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] profile::pyrra::filesystem::slo: add new slo define

https://gerrit.wikimedia.org/r/1182886

Change #1182886 merged by Herron:

[operations/puppet@production] profile::pyrra::filesystem::slo: add new slo define

https://gerrit.wikimedia.org/r/1182886

Change #1184566 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] thanos-store: set cutoff days to 1

https://gerrit.wikimedia.org/r/1184566

Change #1184566 merged by Herron:

[operations/puppet@production] thanos-store: set cutoff days to 1

https://gerrit.wikimedia.org/r/1184566

Backfill process has been documented in https://wikitech.wikimedia.org/wiki/Thanos#Backfilling_Metrics and used successfully several times. Resolving!

tappof reopened this task as Open.EditedDec 2 2025, 2:19 PM
tappof subscribed.

Due to the issues described in T410152: Disk space saturation (/srv) on Titan hosts, reverting the patch https://gerrit.wikimedia.org/r/1184566 was necessary.

I’m reopening this task.

To work around the revert, I’d propose adopting a parallel Thanos Store dedicated only to the blocks labeled recorder=backfill. This would allow us to request the backfilled blocks immediately, as soon as they are generated, (hopefully) without introducing any further issues.

image.png (488×724 px, 21 KB)