Page MenuHomePhabricator

Several days of metrics not uploaded to Thanos object storage from Prometheus on PoPs
Closed, ResolvedPublic

Description

While investigating something else I noticed that data post-migration in T243057 wasn't being uploaded to Thanos. We've reenabled uploads however there's a gap that needs to be backfilled for each of eqsin/esams/ulsfo as seen from https://thanos.wikimedia.org/bucket/ :

2020-10-20-155321_475x206_scrot.png (206×475 px, 8 KB)

One question is how the missing data will fit in with the rest: the existing blocks on e.g. prometheus5001 are compacted already (in 24h blocks) and maybe the Thanos compactor will DTRT if we upload the blocks and pretend they belong to prometheus5001 (i.e. upload said blocks with the exact same labels, most importantly replica: a)

Alternatively I think a viable solution that should not bother/involve the Thanos compactor is uploading the missing data as if it were on another replica (e.g. replica: b, site: eqsin, prometheus:ops in the example above). Then at query time Thanos will DTRT and merge/deduplicate results from different replicas.

At any rate, we'll need to test any/all of those scenarios first, before actually uploading data to the production Thanos bucket. Therefore we'll need to:

  • replicate the subset of data we're interested in that's already uploaded (i.e. data from PoPs that contains the gap) from the production bucket to a test bucket
  • upload the missing prometheus blocks from PoPs into the test bucket
  • run the compactor on the test bucket and see what happens

Details

Due Date
Dec 9 2020, 4:00 PM

Event Timeline

lmata set Due Date to Dec 9 2020, 4:00 PM.Oct 26 2020, 3:33 PM
lmata moved this task from Backlog to In progress on the observability board.

Copies of the missing blocks have been made into /root/gap_blocks on each of the prometheus pop instances

In theory we should be able to backfill these metrics by standing up a temporary prometheus instance or each site with external_labels: replica: b, and starting a thanos sidecar for that instance with --shipper.upload-compacted

What I'm unsure of at the moment is how best to test that before running it against production. We have the stack installed on pontoon-thanos in labs but at the moment it seems the object backend isn't up (or I am just misunderstanding the config). Throwing back the thanos bucket web port of pontoon-thanos-01 via ssh tunnel works, but currently shows no blocks with Error: Error fetching : Not Found

After some testing, I think this may be a viable approach for backfilling:

On each of the prometheus pop hosts, create a temporary 'backfill' instance on disk, along the lines of:

mkdir -p /tmp/prometheus/backfill
rsync -av --exclude metrics /srv/prometheus/ops/ /tmp/prometheus/backfill/
mkdir /tmp/prometheus/backfill/metrics
mv $gap_blocks /tmp/prometheus/backfill/metrics

Update external_labels: replica: b in /tmp/prometheus/backfill/promteheus.yml

Then start a 'backfill' prometheus instance:

/usr/bin/prometheus --storage.tsdb.path /tmp/prometheus/backfill/metrics --web.listen-address 127.0.0.1:9999 --web.external-url http://prometheus/backfill --storage.tsdb.retention 180d --config.file /tmp/prometheus/backfill/prometheus.yml --storage.tsdb.max-block-duration=24h --storage.tsdb.min-block-duration=2h --query.max-samples=10000000

Along with a thanos 'backfill' sidecar:

/usr/bin/thanos sidecar --http-address 0.0.0.0:19999 --grpc-address 0.0.0.0:29999 --tsdb.path /tmp/prometheus/backfill/metrics --prometheus.url http://localhost:9999/backfill --objstore.config-file /etc/thanos-sidecar@ops/objstore.yaml --min-time=-15d --shipper.ignore-unequal-block-size --shipper.upload-compacted

The sidecar comes up like this:

level=info ts=2020-12-08T18:04:55.917912524Z caller=main.go:138 msg="Tracing will be disabled"
level=info ts=2020-12-08T18:04:55.918411572Z caller=options.go:23 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2020-12-08T18:04:55.919152084Z caller=factory.go:46 msg="loading bucket configuration"
level=info ts=2020-12-08T18:04:55.919854434Z caller=sidecar.go:291 msg="starting sidecar"
level=info ts=2020-12-08T18:04:55.920191273Z caller=intrumentation.go:60 msg="changing probe status" status=healthy
level=info ts=2020-12-08T18:04:55.920265099Z caller=http.go:58 service=http/server component=sidecar msg="listening for requests and metrics" address=0.0.0.0:19999
level=info ts=2020-12-08T18:04:55.921323536Z caller=reloader.go:183 component=reloader msg="nothing to be watched"
level=info ts=2020-12-08T18:04:55.921368087Z caller=intrumentation.go:48 msg="changing probe status" status=ready
level=info ts=2020-12-08T18:04:55.921415181Z caller=grpc.go:114 service=gRPC/server component=sidecar msg="listening for serving gRPC" address=0.0.0.0:29999
level=warn ts=2020-12-08T18:04:55.924060138Z caller=sidecar.go:322 msg="flag to ignore Prometheus min/max block duration flags differing is being used. If the upload of a 2h block fails and a Prometheus compaction happens that block may be missing from your Thanos bucket storage."
level=info ts=2020-12-08T18:04:55.942959627Z caller=sidecar.go:155 msg="successfully loaded prometheus external labels" external_labels="{prometheus=\"ops\", replica=\"b\", site=\"eqiad\"}"
level=info ts=2020-12-08T18:04:55.94315942Z caller=intrumentation.go:48 msg="changing probe status" status=ready
level=info ts=2020-12-08T18:04:58.011833742Z caller=shipper.go:204 msg="gathering all existing blocks from the remote bucket for check" id=01EJS604YFE38Y2Y4SZASSE22N
level=info ts=2020-12-08T18:05:10.824673168Z caller=shipper.go:333 msg="upload new block" id=01EJS604YFE38Y2Y4SZASSE22N

And can see below that an example gap block from September 01EJS604YFE38Y2Y4SZASSE22N was uploaded successfully:

Screen Shot 2020-12-08 at 1.18.22 PM.png (1×4 px, 467 KB)

Of course the site label should match the pop instance being backfilled. In this case the test env runs from labs in eqiad.

Before doing this in prod a sanity check would be helpful, so please let me know what you think of the approach. If it looks sound I'll move on to perform this on one of the pop instances.

Also worth thinking about too is if this best left as a one-off, or if creating a 'backfill' instance is something that may actually be useful to have pre-configured for future use.

Procedure overall LGTM! We can skip targets directory from the backfill instance since we're not interested in scraping metrics. (Minor thing) I think if you set --storage.tsdb.max-block-duration=2h in the prometheus invocation then you can drop thanos shipper --shipper.ignore-unequal-block-size although it shouldn't matter in this case.

re: backfill instance, probably not worth having the instance long term and/or in puppet. Perhaps sth that might help in this case (and future) is to have a script to launch prometheus/thanos with minimal inputs (e.g. ports, a directory, etc) that will bring up both, although I'm not convinced is worth it in this case since we can basically copy/paste over three PoPs

The missing cache pop metrics have been backfilled using the above method and the thanos bucket web viewer no longer shows a gap. I think we're good here!

re: backfill instance, probably not worth having the instance long term and/or in puppet. Perhaps sth that might help in this case (and future) is to have a script to launch prometheus/thanos with minimal inputs (e.g. ports, a directory, etc) that will bring up both, although I'm not convinced is worth it in this case since we can basically copy/paste over three PoPs

That's fair, we could always refer back to the process outlined here as well.