Page MenuHomePhabricator

ThanosCompactHalted error on overlapping blocks
Closed, ResolvedPublic

Description

I'm investigating the alert and it looks like thanos-compact has found overlapping blocks:

Apr 15 03:14:36 thanos-fe2001 thanos-compact[2170447]: level=error ts=2023-04-15T03:14:36.844242755Z caller=compact.go:487 msg="critical error detected; halting" err="compaction: group 0@11747970091815595079: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1681515791814, maxt: 1681516800000, range: 16m48s, blocks: 2]: <ulid: 01GY16T5WPXQ5T6SH9VC0DYTQ4, mint: 1681509600106, maxt: 1681516800000, range: 1h59m59s>, <ulid: 01GY1CQ4EAKRV9BQ8D9JB1VWGJ, mint: 1681515791814, maxt: 1681516800000, range: 16m48s>"

I have dumped the thanos bucket to see what's going on with these blocks:

thanos-fe2001# thanos tools bucket inspect --objstore.config-file /etc/thanos-bucket-web/objstore.yaml | tee /root/thanos-bucket
thanos-fe2001:~# grep 01GY1CQ4EAKRV9BQ8D9JB1VWGJ /root/thanos-bucket
| 01GY1CQ4EAKRV9BQ8D9JB1VWGJ | 2023-04-14T23:43:11Z | 2023-04-15T00:00:00Z | 16m48.186s     | 39h43m11.814s   | 344,397    | 4,239,191       | 344,397       | 1          | false       | prometheus=ops,replica=unset,site=drmrs       | 0s         | sidecar   |
thanos-fe2001:~# grep 01GY16T5WPXQ5T6SH9VC0DYTQ4 /root/thanos-bucket
| 01GY16T5WPXQ5T6SH9VC0DYTQ4 | 2023-04-14T22:00:00Z | 2023-04-15T00:00:00Z | 1h59m59.894s   | 38h0m0.106s     | 349,127    | 41,488,126      | 352,085       | 1          | false       | prometheus=ops,replica=unset,site=drmrs       | 0s         | sidecar   |

Note the start time do indeed overlap, and the former being not on the two hour boundary. I'm not sure yet exactly how that happened, though I'm guessing it is related to the work in T309979: Upgrade Prometheus VMs in PoPs to Bullseye.

Actions/followups:

  • Find and nuke the non-aligned blocks (the one above and others too)
  • Make sure we always specify a replica label (i.e. make puppet fail on replica=unset)

Event Timeline

Change 912381 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] prometheus: Add label to prometheus3002 data blocks to prevent data duplication

https://gerrit.wikimedia.org/r/912381

Change 912383 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] prometheus: Add label to prometheus4002 data blocks to prevent data duplication

https://gerrit.wikimedia.org/r/912383

Change 912385 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] prometheus: Add label to prometheus5002 data blocks to prevent data duplication

https://gerrit.wikimedia.org/r/912385

Change 912407 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] prometheus: Add label to prometheus6001 data blocks to prevent data duplication

https://gerrit.wikimedia.org/r/912407

Change 912409 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] prometheus: Add label to prometheus6002 data blocks to prevent data duplication

https://gerrit.wikimedia.org/r/912409

Mentioned in SAL (#wikimedia-operations) [2023-04-27T09:00:23Z] <godog> delete overlapping block 01GY1CQ4EAKRV9BQ8D9JB1VWGJ from thanos - T335406

I've deleted one of the offending blocks with the following:

# mark the block for deletion
thanos-fe2001:~# thanos tools bucket mark --id 01GY1CQ4EAKRV9BQ8D9JB1VWGJ --details T335406 --marker deletion-mark.json --objstore.config-file /etc/thanos-bucket-web/objstore.yaml
# force an immediate deletion of all blocks marked for deletion
thanos-fe2001:~# thanos tools bucket cleanup --delete-delay=0s --objstore.config-file /etc/thanos-bucket-web/objstore.yaml

Mentioned in SAL (#wikimedia-operations) [2023-04-27T09:09:04Z] <godog> restart thanos-compact on thanos-fe2001 - T335406

Sadly thanos compact found more overlapping blocks:

Apr 27 09:29:27 thanos-fe2001 thanos-compact[852784]: level=error ts=2023-04-27T09:29:27.982548445Z caller=compact.go:487 msg="critical error detected; halting" err="compaction: group 0@11747970091815595079: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1682546400057, maxt: 1682553600000, range: 1h59m59s, blocks: 2]: <ulid: 01GZ03JSRERTESRWYNJAEKAZ8Y, mint: 1682546400057, maxt: 1682553600000, range: 1h59m59s>, <ulid: 01GZ03JSQVG6VKAK3D91Y1ZH86, mint: 1682546400057, maxt: 1682553600000, range: 1h59m59s>\n[mint: 1681653600090, maxt: 1681660800000, range: 1h59m59s, blocks: 2]: <ulid: 01GY5G4PREZ1W9JBR2SPQAAZGP, mint: 1681653600015, maxt: 1681660800000, range: 1h59m59s>, <ulid: 01GY5G4PQX14KV87ZSMJA4J64K,
...

I'm going to remove all replica=unset blocks from the new instances

Mentioned in SAL (#wikimedia-operations) [2023-04-27T09:39:56Z] <godog> delete all 2023 replica=unset blocks from thanos - T335406

Sadly thanos compact found more overlapping blocks:

Apr 27 09:29:27 thanos-fe2001 thanos-compact[852784]: level=error ts=2023-04-27T09:29:27.982548445Z caller=compact.go:487 msg="critical error detected; halting" err="compaction: group 0@11747970091815595079: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1682546400057, maxt: 1682553600000, range: 1h59m59s, blocks: 2]: <ulid: 01GZ03JSRERTESRWYNJAEKAZ8Y, mint: 1682546400057, maxt: 1682553600000, range: 1h59m59s>, <ulid: 01GZ03JSQVG6VKAK3D91Y1ZH86, mint: 1682546400057, maxt: 1682553600000, range: 1h59m59s>\n[mint: 1681653600090, maxt: 1681660800000, range: 1h59m59s, blocks: 2]: <ulid: 01GY5G4PREZ1W9JBR2SPQAAZGP, mint: 1681653600015, maxt: 1681660800000, range: 1h59m59s>, <ulid: 01GY5G4PQX14KV87ZSMJA4J64K,
...

I'm going to remove all replica=unset blocks from the new instances

This seems to have done it, the compactor is running and hasn't had problems yet. I'll leave the task open and revisit next week

Yesterday I found that prometheus6001 didn't had a replica label added.
I think that the best approach for this would be to either add the label to the data in Thanos or to reindex from Prometheus using a label.

Yesterday I found that prometheus6001 didn't had a replica label added.
I think that the best approach for this would be to either add the label to the data in Thanos or to reindex from Prometheus using a label.

Great find @andrea.denisse ! I didn't realize that was the case, and too eagerly deleted the replica=unset data for april 2023, which means also the actual drmrs data got deleted from Thanos for April unfortunately :| This is not an issue because the data is still in Prometheus drmrs and Thanos queries both, so no gap is visible. Once data falls off Prometheus drmrs retention (~three months) then there will be a gap in Thanos metrics too

Change 912407 merged by Andrea Denisse:

[operations/puppet@production] prometheus: Add label to prometheus6001 data blocks to prevent data duplication

https://gerrit.wikimedia.org/r/912407

Change 912381 merged by Andrea Denisse:

[operations/puppet@production] prometheus: Add label to prometheus3002 data blocks to prevent data duplication

https://gerrit.wikimedia.org/r/912381

Change 912383 merged by Andrea Denisse:

[operations/puppet@production] prometheus: Add label to prometheus4002 data blocks to prevent data duplication

https://gerrit.wikimedia.org/r/912383

Change 912385 merged by Andrea Denisse:

[operations/puppet@production] prometheus: Add label to prometheus5002 data blocks to prevent data duplication

https://gerrit.wikimedia.org/r/912385

Change 912409 merged by Andrea Denisse:

[operations/puppet@production] prometheus: Add label to prometheus6002 data blocks to prevent data duplication

https://gerrit.wikimedia.org/r/912409