Page MenuHomePhabricator

Audit stored Thanos data
Closed, ResolvedPublic

Description

While investigating Thanos space usage in T336234 I noticed that while we've set retention for raw data to 54w there's still data older than that in storage, for example this block from Aug 2021:

| 01FDE9NEEEBRHB306ZG2Z991TJ | 2021-08-05T00:00:00Z | 2021-08-19T00:00:00Z | 336h0m0s       | -296h0m0s       | 6,200,883  | 96,715,299,071  | 824,376,596   | 4          | false       | prometheus=ops,replica=b,site=codfw           | 0s         | compactor |

I also noticed that blocks starting at about Sept 2022 are failing to get downsampled due to unexpected EOF error while downloading blocks, for example:

Apr 10 18:09:56 thanos-fe2001 thanos-compact[1135931]: level=error ts=2023-04-10T18:09:56.703155229Z caller=main.go:161 err="downsampling to 5 min: download block 01GT7S08FT1702C2AA67VD85ER: copy object to file: unexpected EOF\nfirst pass of downsampling failed\nmain.runCompact.func7\n\t/build/thanos-0.30.1/cmd/thanos/compact.go:441\nmain.runCompact.func8.1\n\t/build/thanos-0.30.1/cmd/thanos/compact.go:477\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/build/thanos-0.30.1/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/build/thanos-0.30.1/cmd/thanos/compact.go:476\ngithub.com/oklog/run.(*Group).Run.func1\n\t/tmp/thanos-build/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/lib/go-1.19/src/runtime/asm_amd64.s:1594\nerror executing compaction\nmain.runCompact.func8.1\n\t/build/thanos-0.30.1/cmd/thanos/compact.go:504\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/build/thanos-0.30.1/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/build/thanos-0.30.1/cmd/thanos/compact.go:476\ngithub.com/oklog/run.(*Group).Run.func1\n\t/tmp/thanos-build/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/lib/go-1.19/src/runtime/asm_amd64.s:1594\ncompact command failed\nmain.main\n\t/build/thanos-0.30.1/cmd/thanos/main.go:161\nruntime.main\n\t/usr/lib/go-1.19/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/lib/go-1.19/src/runtime/asm_amd64.s:1594"

Event Timeline

To enforce a run of retention policy I'm planning to do the following on thanos-fe2001:

  1. Stop thanos-compact and disable puppet
  2. Run thanos tools bucket retention --objstore.config-file /etc/thanos-store/objstore.yaml --retention.resolution-raw=54w
  3. Verify that old blocks (such as 01FDE9NEEEBRHB306ZG2Z991TJ) are marked for deletion (will be actually deleted by the compactor in 48h)
  4. Enable and run puppet, this will start thanos-compact too

Mentioned in SAL (#wikimedia-operations) [2023-05-29T08:45:36Z] <godog> delete old raw blocks from thanos - T337236

fgiunchedi changed the task status from Open to Stalled.May 29 2023, 9:00 AM

Blocks have been marked for deletion, will check back in two days

fgiunchedi claimed this task.

This is done, fs utilization is at 72% now and old blocks for raw data have been removed