Page MenuHomePhabricator

Prometheus/Pyrra: establish backfill process for recording rules
Open, Needs TriagePublic

Description

As we onboard SLOs Pyrra will automatically create a set of recording rules. Prometheus has a full history of the metrics used in our SLOs, however the recording rules used behind the scenes will only collect data from the time when the recording rule is deployed forwards. In order to see historical data via recording rules, we'll need to backfill these recording rule metrics somehow.

Upstream has an open request to make this process automatic within prometheus/pyrra https://github.com/prometheus/prometheus/issues/10202 and describes the issue well.

In the mean time, let's identify a workable in house process to backfill data into these recording rules so that new (or recently changed) SLOs will show a full history.

Event Timeline

Most promising option at the moment looks like https://prometheus.io/docs/prometheus/latest/command-line/promtool/#promtool-tsdb-create-blocks-from-rules

Going to give this a shot with one of our existing recording rules

Thank you for looking into this, I'm definitely +1 on trying, though please either test in Pontoon end-to-end or don't feed back the resulting blocks to Prometheus production as I'm not sure what might happen especially on the Thanos side (sidecar + compator)

The initial approach I'm thinking of is running promtool tsdb create-blocks-from rules against the recording rules created by pyrra (since they are new and have no production dependencies yet) and output that into an unused scratch directory. Then transfer that over to pontoon and experiment with loading/using them, and go from there. Sound alright?

The initial approach I'm thinking of is running promtool tsdb create-blocks-from rules against the recording rules created by pyrra (since they are new and have no production dependencies yet) and output that into an unused scratch directory. Then transfer that over to pontoon and experiment with loading/using them, and go from there. Sound alright?

+1 sounds good to me!

Change 967969 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] prometheus: apt::pin prometheus package to bullseye-backports

https://gerrit.wikimedia.org/r/967969

Mentioned in SAL (#wikimedia-operations) [2023-10-23T18:31:57Z] <herron> sretest1001:~/tmp/backfill$ promtool tsdb create-blocks-from rules --start 1672531200 --end 1698080718 --url http://prometheus.svc.eqiad.wmnet/ops/ logstash-requests.yaml T349521

promtool tsdb create-blocks-from rules is looking promising so far. Here's an slo recording rule that was deployed 2 days ago with history going back 12 weeks:

Screen Shot 2023-10-23 at 5.16.44 PM.png (1×4 px, 313 KB)

What I did was export the rules generated by pyrra to sretest (because create-blocks-from rules needs a newer prometheus version than whats installed in production) and executed the logged command above. That ran for 15-20 minutes, then I copied the data generated to a prometheus host, adjusted permissions and copied it into the metrics location.

Notes:

  1. One set of pyrra recording rules created about 500M worth of data going back to the beginning of 2023
  2. Importing blocks required enabling --storage.tsdb.allow-overlapping-blocks on the prometheus instance
  3. Prometheus throws too many open files errors when configured with 32k max open files limit, worked ok after increase to 500k limit.
  4. Prometheus immediately deleted blocks exceeding the tsdb retention value, and the thanos sidecar threw a bunch of no such file errors while attempting to read older blocks. We will likely need to temporarily extend the retention on the instance being backfilled, or figure out some sort of short lived prometheus/sidecar for backfilling

Change #967969 abandoned by Herron:

[operations/puppet@production] prometheus: apt::pin prometheus package to bullseye-backports

Reason:

https://gerrit.wikimedia.org/r/967969

With T352756 T359879 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1017873 in mind I think It'd be worth spending some time here to work out a strategy for bringing backfilled metrics into production.

More specifically with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1017873 we've changed the label set associated with existing thanos rule recording rules, and it'd be ideal to re-generate these metrics for consistency (and to fill in some operational gaps)

The testing in T349521#9274342 was successful in creating and loading backfilled blocks, however it hasn't been brought into production yet. Testing there was done querying production prometheus to create the backfill data, and then backfilled into the pontoon testing environment.

However, I can say that to map the testing approach into production we would need a prometheus instance with...

  • long retention (>= the span being backfilled)
  • high open file limit
  • allow-overlapping-blocks enabled*

Creating a backfill or utility prometheus instance with a thanos sidecar comes to mind. We could leave that instance tuned with the settings above, and write some procedures about how to generate and drop tsdb backfill files into place as needed for onboarding and changing recording rules. I'm thinking the instance would normally be in an idle state, activated only when metrics are being backfilled.

*Thinking out loud we could also consider using a replica label to handle overlapping blocks due to backfill. Backfilled metrics could be inserted with replica=backfill, or replica=z, or something like that to automatically handle the case where existing metrics and backfilled metrics overlap for a portion of the time range being backfilled.

@fgiunchedi @elukey interested in what you think about this, or if a different approach comes to mind?

Following up from a chat yesterday:

The idea of creating backfilled blocks is sound, although I think we can get away with uploading said blocks straight to thanos (making sure we're using distinct labels with e.g. recoder=backfill) and they will be compacted and available as usual (to be tested!)

Very ignorant about the internals but the procedure seems sound! In the ML case, we could live without backfilling the previous quarters/SLO time series, the more pressing thing is to start from a clean state (without gaps etc..). Thanks for the work!