I'm investigating the alert and it looks like thanos-compact has found overlapping blocks:
Apr 15 03:14:36 thanos-fe2001 thanos-compact[2170447]: level=error ts=2023-04-15T03:14:36.844242755Z caller=compact.go:487 msg="critical error detected; halting" err="compaction: group 0@11747970091815595079: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1681515791814, maxt: 1681516800000, range: 16m48s, blocks: 2]: <ulid: 01GY16T5WPXQ5T6SH9VC0DYTQ4, mint: 1681509600106, maxt: 1681516800000, range: 1h59m59s>, <ulid: 01GY1CQ4EAKRV9BQ8D9JB1VWGJ, mint: 1681515791814, maxt: 1681516800000, range: 16m48s>"
I have dumped the thanos bucket to see what's going on with these blocks:
thanos-fe2001# thanos tools bucket inspect --objstore.config-file /etc/thanos-bucket-web/objstore.yaml | tee /root/thanos-bucket thanos-fe2001:~# grep 01GY1CQ4EAKRV9BQ8D9JB1VWGJ /root/thanos-bucket | 01GY1CQ4EAKRV9BQ8D9JB1VWGJ | 2023-04-14T23:43:11Z | 2023-04-15T00:00:00Z | 16m48.186s | 39h43m11.814s | 344,397 | 4,239,191 | 344,397 | 1 | false | prometheus=ops,replica=unset,site=drmrs | 0s | sidecar | thanos-fe2001:~# grep 01GY16T5WPXQ5T6SH9VC0DYTQ4 /root/thanos-bucket | 01GY16T5WPXQ5T6SH9VC0DYTQ4 | 2023-04-14T22:00:00Z | 2023-04-15T00:00:00Z | 1h59m59.894s | 38h0m0.106s | 349,127 | 41,488,126 | 352,085 | 1 | false | prometheus=ops,replica=unset,site=drmrs | 0s | sidecar |
Note the start time do indeed overlap, and the former being not on the two hour boundary. I'm not sure yet exactly how that happened, though I'm guessing it is related to the work in T309979: Upgrade Prometheus VMs in PoPs to Bullseye.
Actions/followups:
- Find and nuke the non-aligned blocks (the one above and others too)
- Make sure we always specify a replica label (i.e. make puppet fail on replica=unset)