Maniphest T351927

Decide and tweak Thanos retention
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	fgiunchedi
	Nov 24 2023, 12:25 PM

Description

I noticed warnings for thanos-be hosts hovering at around 94%, it is time to decide and tweak Thanos retention. At the moment we have the following:

aggregation: retention time
raw: 54w
5m: 270w
1h: 270w

Pending actions:

Analyze the data from thanos tools bucket inspect to get size breakdown (per instance, per aggregation, etc)
Come to a consensus on what to drop for now. An easy candidate is replicas (a, b) since most of the time we have both prometheus replicas up and running.
- After some analysis as of 2024-01-23 the cleanable replicated blocks for our biggest instance ops (older than 3 months) are as follows: (note that for 0s we already cleaned up in the past, whereas 5m and 1h haven't been cleaned up beside the regular retention outlined above)

resolution	GBs
0s	1533
5m0s	20856
1h0m0s	2483

Do capacity planning based on the retention we'd like
- As it stands we're averaging 400GB/day of new data, plus temporary spikes of ~2TB twice a week when the compactor runs and space gets freed afterwards:

2024-01-23-152536_2447x1666_scrot.png (1×2 px, 294 KB)

Block cleanup strategies

In the short term there are "low hanging" fruits in terms of replicated data we can quickly delete and shed some storage space load/pressure.

Currently in eqiad and codfw we have two identical Prometheus hosts in each site, configured the same and doing the same work. They upload their data blocks periodically to Thanos for long term storage, each block uploaded is labeled with prometheus=<instance> and site=<site> and replica=[ab] so we can identify the block source later when reading data and de-duplicating it as needed (thanos-query does this job). Note that replica label is intentionally abstracted from the hostname that uploaded the data, and can be (as of Jan 2024) either a or b.

So far we have kept replicated data blocks until Thanos' retention time, this is helpful because e.g. on data missing from replica=a (e.g. during maint/reboot/etc) it is possible to read the missing data from replica=b and vice-versa.
While this strategy works well, it also means we end up with a lot of basically-duplicated data for the most part. Thanos supports the so-called vertical compaction to de-duplicate blocks and merge into one, however that is a more invasive change and comes with its own caveats/risks. For the purposes/scope of this task the focus will be on deleting duplicated blocks.

The strategy implemented so far is the following:

Select all blocks older than three months, with site=codfw or site=eqiad and resolution 0s (i.e. raw data)
Group blocks by their prometheus instance and start/end time
From each group, pick the block with fewer samples (i.e. less data) and mark it for deletion

Cleanup size

The following table shows space reclaimable (> 50GB) for duplicated blocks older than three months, grouped by resolution first and ordered by size.

resolution	size in GB	labels
0s	51.0	prometheus=analytics,site=eqiad
0s	52.7	prometheus=k8s-staging,site=codfw
0s	53.6	prometheus=k8s-mlstaging,site=codfw
0s	65.7	prometheus=k8s-staging,site=eqiad
0s	93.0	prometheus=ext,site=eqiad
0s	149.8	prometheus=k8s-mlserve,site=codfw
0s	176.8	prometheus=k8s-mlserve,site=eqiad
0s	212.2	prometheus=services,site=codfw
0s	226.5	prometheus=services,site=eqiad
0s	641.9	prometheus=ops,site=codfw
0s	784.6	prometheus=k8s,site=codfw
0s	815.2	prometheus=k8s,site=eqiad
0s	891.4	prometheus=ops,site=eqiad
1h0m0s	83.0	prometheus=services,site=codfw
1h0m0s	88.1	prometheus=services,site=eqiad
1h0m0s	386.1	prometheus=k8s,site=codfw
1h0m0s	389.4	prometheus=k8s,site=eqiad
1h0m0s	1038.4	prometheus=ops,site=codfw
1h0m0s	1444.9	prometheus=ops,site=eqiad
5m0s	64.8	prometheus=k8s-mlstaging,site=codfw
5m0s	97.2	prometheus=ext,site=eqiad
5m0s	106.8	prometheus=k8s-staging,site=codfw
5m0s	138.7	prometheus=k8s-staging,site=eqiad
5m0s	174.2	prometheus=analytics,site=eqiad
5m0s	233.3	prometheus=k8s-mlserve,site=codfw
5m0s	246.8	prometheus=k8s-mlserve,site=eqiad
5m0s	700.0	prometheus=services,site=codfw
5m0s	729.2	prometheus=services,site=eqiad
5m0s	1728.5	prometheus=k8s,site=codfw
5m0s	1754.4	prometheus=k8s,site=eqiad
5m0s	8692.1	prometheus=ops,site=codfw
5m0s	12164.1	prometheus=ops,site=eqiad

The equivalent sqlite query being:

SELECT resolution,
       ROUND(SUM(size_mb / 1024), 1) AS size_gb,
       REPLACE(REPLACE(labels, "replica=a,", ""), "replica=b,", "") AS instances
  FROM replicated_blocks
  JOIN blocks_sizes ON replicated_blocks.ulid = blocks_sizes.ulid
  WHERE until_timestamp < datetime('now', '-3 months')
  GROUP BY resolution, instances
  HAVING size_gb > 50
  ORDER BY resolution, size_gb;

Details

Subject	Repo	Branch	Lines +/-
titan: trim 5m retention to 2y + 2w	operations/puppet	production	+1 -1
titan: trim 5m retention to 3y + 2w	operations/puppet	production	+1 -1
titan: trim 5m retention to 4y + 1w	operations/puppet	production	+2 -1
thanos: fix bucket-query tools import	operations/puppet	production	+16 -9
thanos: add labels to thanos-rule blocks	operations/puppet	production	+2 -0
thanos: add replicated blocks view to bucket-query	operations/puppet	production	+24 -0
thanos: fix bucket-query tools	operations/puppet	production	+3 -5
thanos: add bucket query tools	operations/puppet	production	+323 -1

Customize query in gerrit

Related Objects

Mentioned In: T357747: Capacity planning/estimation for Thanos
T355465: thanos-be1001 disk space alerts
T353091: Disk space thanos-be1001:9100 alert
T351935: Audit Prometheus metrics size/label values
Mentioned Here: T357747: Capacity planning/estimation for Thanos

Event Timeline

fgiunchedi created this task.Nov 24 2023, 12:25 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 24 2023, 12:25 PM

fgiunchedi mentioned this in T351935: Audit Prometheus metrics size/label values.Nov 24 2023, 3:51 PM

fgiunchedi added a project: User-fgiunchedi.Nov 27 2023, 8:43 AM

Adding to ongoing quarter for visibility.

fgiunchedi updated the task description. (Show Details)Dec 1 2023, 10:22 AM

Re: dropping replica labels we may be able to instruct thanos compact to deduplicate them instead (not tested)

--deduplication.func=     Experimental. Deduplication algorithm for merging overlapping blocks. Possible values are: "", "penalty". If no value is specified, the default compact deduplication merger is used, which performs 1:1 deduplication for samples. When set to penalty, penalty based deduplication algorithm will be used. At least one replica label has to be set via --deduplication.replica-label flag.

--deduplication.replica-label=DEDUPLICATION.REPLICA-LABEL ...
                          Label to treat as a replica indicator of blocks that can be deduplicated (repeated flag). This will merge multiple replica blocks into one. This process is irreversible. Experimental. When one or more labels are set, compactor will ignore the given labels so that vertical compaction can merge the blocks.Please note that by default this uses a NAIVE algorithm for merging which works well for deduplication of blocks with **precisely the same samples** like produced by Receiver replication.If you need a different deduplication algorithm (e.g one that works well with Prometheus replicas), please set it via --deduplication.func.

Indeed compacting multiple replicas vertically would be optimal, though it comes with significant caveats/risks: https://thanos.io/tip/components/compact.md/#vertical-compaction-risks

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Dec 5 2023, 8:23 AM

fgiunchedi mentioned this in T353091: Disk space thanos-be1001:9100 alert.Dec 11 2023, 1:11 PM

MatthewVernon awarded a token.Dec 12 2023, 10:09 AM

MatthewVernon subscribed.

I've imported the data from thanos bucket inspect into a sqlite database to get a list of blocks candidate for deletion.

The query strategy is to select older raw resolution and compacted blocks, and pick the replica with fewer samples, for example for ops prometheus in eqiad:

SELECT ulid,from_timestamp,until_timestamp,samples,labels FROM blocks WHERE
  labels like '%site=eqiad%' AND
  labels like '%prometheus=ops%' AND
  until_timestamp < datetime('now', '-7 months') AND
  resolution = '0s' AND
  source = 'compactor'
  GROUP BY until_timestamp
  HAVING samples = min(samples);

Results in

ULID|FROM_TIMESTAMP|UNTIL_TIMESTAMP|SAMPLES|LABELS
01GT7S08FT1702C2AA67VD85ER|2022-11-24T00:00:00Z|2022-12-08T00:00:00Z|181,480,224,219|prometheus=ops,replica=b,site=eqiad
01GTGAAAHAJB0AZJNS2M372BX6|2022-12-08T00:00:00Z|2022-12-22T00:00:00Z|177,143,395,038|prometheus=ops,replica=b,site=eqiad
01GTKGKC9D77HBX4J434QWBYJ2|2022-12-22T00:00:00Z|2023-01-05T00:00:00Z|176,201,562,361|prometheus=ops,replica=b,site=eqiad
01GTMSY8BNP6NM6E0D5F2195X5|2023-01-05T00:00:00Z|2023-01-19T00:00:00Z|169,368,078,983|prometheus=ops,replica=b,site=eqiad
01GTNKAEX1SGPDDF2WYWJ56H1M|2023-01-19T00:00:00Z|2023-02-02T00:00:00Z|171,540,646,681|prometheus=ops,replica=b,site=eqiad
01GTR0NWB4FG4Z5S0S5RTTM95S|2023-02-02T00:00:00Z|2023-02-16T00:00:00Z|174,459,940,491|prometheus=ops,replica=b,site=eqiad
01GTVDSVZD26ZR5D6TDADX55PK|2023-02-16T00:00:00Z|2023-03-02T00:00:00Z|175,664,076,986|prometheus=ops,replica=b,site=eqiad
01GVN1NVVW8Q7V8KY9DTMAN4QS|2023-03-02T00:00:00Z|2023-03-16T00:00:00Z|174,324,191,772|prometheus=ops,replica=a,site=eqiad
01GWSB154ZNA17PCPY5G1DMA6B|2023-03-16T00:00:00Z|2023-03-30T00:00:00Z|176,567,951,358|prometheus=ops,replica=b,site=eqiad
01GXWSPMMPBCCQSMB3SVS4F1TC|2023-03-30T00:00:00Z|2023-04-13T00:00:00Z|176,079,867,844|prometheus=ops,replica=b,site=eqiad
01GZC8K1AP9TE2KKFNR19EYQF2|2023-04-13T00:00:00Z|2023-04-27T00:00:00Z|177,098,719,062|prometheus=ops,replica=b,site=eqiad
01H0529XHWM4NNMKVVDHQV6JR2|2023-04-27T00:00:00Z|2023-05-11T00:00:00Z|179,128,670,126|prometheus=ops,replica=b,site=eqiad

To verify we're picking replicas with fewer samples, changing min(samples) to max(samples) results in:

ULID|FROM_TIMESTAMP|UNTIL_TIMESTAMP|SAMPLES|LABELS
01GNMZCREEYWPGGE2HBSWK46HG|2022-11-24T00:00:00Z|2022-12-08T00:00:00Z|181,482,994,979|prometheus=ops,replica=a,site=eqiad
01GNNY7ECW32J4QM9JF5H217AH|2022-12-08T00:00:00Z|2022-12-22T00:00:00Z|177,146,036,771|prometheus=ops,replica=a,site=eqiad
01GP22MA73R13XXS08ZG818CCB|2022-12-22T00:00:00Z|2023-01-05T00:00:00Z|176,203,529,689|prometheus=ops,replica=a,site=eqiad
01GS5TAXDMS2HVRCP23T3A6VXP|2023-01-05T00:00:00Z|2023-01-19T00:00:00Z|169,370,718,911|prometheus=ops,replica=a,site=eqiad
01GS66KKNGZEYM3TTWK85RDKN3|2023-01-19T00:00:00Z|2023-02-02T00:00:00Z|171,544,768,328|prometheus=ops,replica=a,site=eqiad
01GSDQ08PZN66E1PA3JGY13TWD|2023-02-02T00:00:00Z|2023-02-16T00:00:00Z|174,465,489,620|prometheus=ops,replica=a,site=eqiad
01GTHVT3Y4468SRGA2P9E3Y793|2023-02-16T00:00:00Z|2023-03-02T00:00:00Z|175,668,527,926|prometheus=ops,replica=a,site=eqiad
01GVNJRPVH5QZMJ1KBACS1SYQZ|2023-03-02T00:00:00Z|2023-03-16T00:00:00Z|174,396,618,936|prometheus=ops,replica=b,site=eqiad
01GWS1RBH3RNX75P5BSKG7H3JH|2023-03-16T00:00:00Z|2023-03-30T00:00:00Z|176,637,970,322|prometheus=ops,replica=a,site=eqiad
01GXX8TDZPH1NDXGEN6MEEGT27|2023-03-30T00:00:00Z|2023-04-13T00:00:00Z|176,090,094,520|prometheus=ops,replica=a,site=eqiad
01GZA77B8WM2M2R3PZFYMXQY7Y|2023-04-13T00:00:00Z|2023-04-27T00:00:00Z|177,099,758,193|prometheus=ops,replica=a,site=eqiad
01H04QFDGDRNZ7TQ0E8TPJB4E3|2023-04-27T00:00:00Z|2023-05-11T00:00:00Z|179,134,264,589|prometheus=ops,replica=a,site=eqiad

Same query, with size in megabytes per block:

ULID	FROM_TIMESTAMP	UNTIL_TIMESTAMP	SIZE_MB	LABELS
01GT7S08FT1702C2AA67VD85ER	2022-11-24T00:00:00Z	2022-12-08T00:00:00Z	241910.59	prometheus=ops,replica=b,site=eqiad
01GTGAAAHAJB0AZJNS2M372BX6	2022-12-08T00:00:00Z	2022-12-22T00:00:00Z	235267.62	prometheus=ops,replica=b,site=eqiad
01GTKGKC9D77HBX4J434QWBYJ2	2022-12-22T00:00:00Z	2023-01-05T00:00:00Z	234321.63	prometheus=ops,replica=b,site=eqiad
01GTMSY8BNP6NM6E0D5F2195X5	2023-01-05T00:00:00Z	2023-01-19T00:00:00Z	227229.23	prometheus=ops,replica=b,site=eqiad
01GTNKAEX1SGPDDF2WYWJ56H1M	2023-01-19T00:00:00Z	2023-02-02T00:00:00Z	233263.59	prometheus=ops,replica=b,site=eqiad
01GTR0NWB4FG4Z5S0S5RTTM95S	2023-02-02T00:00:00Z	2023-02-16T00:00:00Z	234486.04	prometheus=ops,replica=b,site=eqiad
01GTVDSVZD26ZR5D6TDADX55PK	2023-02-16T00:00:00Z	2023-03-02T00:00:00Z	233727.90	prometheus=ops,replica=b,site=eqiad
01GVN1NVVW8Q7V8KY9DTMAN4QS	2023-03-02T00:00:00Z	2023-03-16T00:00:00Z	209651.18	prometheus=ops,replica=a,site=eqiad
01GWSB154ZNA17PCPY5G1DMA6B	2023-03-16T00:00:00Z	2023-03-30T00:00:00Z	232529.53	prometheus=ops,replica=b,site=eqiad
01GXWSPMMPBCCQSMB3SVS4F1TC	2023-03-30T00:00:00Z	2023-04-13T00:00:00Z	229936.48	prometheus=ops,replica=b,site=eqiad
01GZC8K1AP9TE2KKFNR19EYQF2	2023-04-13T00:00:00Z	2023-04-27T00:00:00Z	229945.20	prometheus=ops,replica=b,site=eqiad
01H0529XHWM4NNMKVVDHQV6JR2	2023-04-27T00:00:00Z	2023-05-11T00:00:00Z	238553.73	prometheus=ops,replica=b,site=eqiad

Or about 3TB to be deleted, I'm going to proceed tomorrow (to start with) with this unless I hear objections

Mentioned in SAL (#wikimedia-operations) [2023-12-13T13:05:22Z] <godog> delete raw replica blocks for prometheus/ops (only one replica) in eqiad - T351927

That freed ~3-4% on thanos-swift, I'm going to repeat the same with codfw instead:

01GNWV15C44T0V7ANKHS77NY1X|2022-11-24T00:00:00Z|2022-12-08T00:00:00Z|131,675,686,900|prometheus=ops,replica=b,site=codfw
01GNXFBA7QRFKG91ADBVWK4XCY|2022-12-08T00:00:00Z|2022-12-22T00:00:00Z|131,404,471,098|prometheus=ops,replica=b,site=codfw
01GV9253BT002MGZ5FTQV43QBR|2022-12-22T00:00:00Z|2023-01-05T00:00:00Z|130,279,427,064|prometheus=ops,replica=a,site=codfw
01GV98NB84WCGGN79G3GB0GYRC|2023-01-05T00:00:00Z|2023-01-19T00:00:00Z|121,843,367,586|prometheus=ops,replica=a,site=codfw
01GS8KR0WM171R0EVYFYF85ZY1|2023-01-19T00:00:00Z|2023-02-02T00:00:00Z|123,903,782,531|prometheus=ops,replica=b,site=codfw
01GSJPQXMMDT7QBSWDBZ5152N8|2023-02-02T00:00:00Z|2023-02-16T00:00:00Z|125,953,279,211|prometheus=ops,replica=b,site=codfw
01GVA3301CMTHVR44Y1NHZN6X5|2023-02-16T00:00:00Z|2023-03-02T00:00:00Z|129,049,596,973|prometheus=ops,replica=a,site=codfw
01GVMFBW16JZNG4WB1VFP7TJ63|2023-03-02T00:00:00Z|2023-03-16T00:00:00Z|135,737,582,597|prometheus=ops,replica=b,site=codfw
01GWRF9HH7R38PF3KCHC0D3CZG|2023-03-16T00:00:00Z|2023-03-30T00:00:00Z|146,615,259,414|prometheus=ops,replica=a,site=codfw
01GXX11YFBFRZ9HBGS8X1A6D1W|2023-03-30T00:00:00Z|2023-04-13T00:00:00Z|147,018,121,864|prometheus=ops,replica=a,site=codfw
01GZBQXZG7B9REXGTMVXX5X4Q6|2023-04-13T00:00:00Z|2023-04-27T00:00:00Z|147,529,895,201|prometheus=ops,replica=b,site=codfw
01H05TQNV56QYFGWYMY0BFB482|2023-04-27T00:00:00Z|2023-05-11T00:00:00Z|146,624,493,156|prometheus=ops,replica=b,site=codfw

Mentioned in SAL (#wikimedia-operations) [2023-12-14T09:22:48Z] <godog> delete raw replica blocks for prometheus/ops (only one replica) in codfw - T351927

That has gone well, I'll be repeating the same process for ops though going back 3 months and deleting only one replica for each of eqiad and codfw

Mentioned in SAL (#wikimedia-operations) [2023-12-14T09:56:36Z] <godog> remove >= 3 months old thanos blocks for prometheus/ops in eqiad/codfw and only for a single replica - T351927

After the last cleanup we're at ~85% on thanos swift, good enough for now

Change 983752 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: add bucket query tools

https://gerrit.wikimedia.org/r/983752

gerritbot added a project: Patch-For-Review.Dec 18 2023, 4:29 PM

fgiunchedi removed a project: SRE Observability (FY2023/2024-Q2).Jan 15 2024, 11:13 AM

Change 983752 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: add bucket query tools

https://gerrit.wikimedia.org/r/983752

Maintenance_bot removed a project: Patch-For-Review.Jan 16 2024, 9:31 AM

MatthewVernon mentioned this in T355465: thanos-be1001 disk space alerts.Jan 22 2024, 9:20 AM

Mentioned in SAL (#wikimedia-operations) [2024-01-23T11:54:26Z] <godog> initial cleanup of replicated thanos blocks - T351927

Change 992413 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: fix bucket-query tools

https://gerrit.wikimedia.org/r/992413

Change 992414 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: add replicated blocks view to bucket-query

https://gerrit.wikimedia.org/r/992414

Change 992415 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: add labels to thanos-rule blocks

https://gerrit.wikimedia.org/r/992415

Change 992413 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: fix bucket-query tools

https://gerrit.wikimedia.org/r/992413

Change 992414 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: add replicated blocks view to bucket-query

https://gerrit.wikimedia.org/r/992414

I have kicked off another round of cleanup for replicated block of prometheus=ops and 0s resolution.

fgiunchedi updated the task description. (Show Details)Jan 23 2024, 2:27 PM

Change 992415 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: add labels to thanos-rule blocks

https://gerrit.wikimedia.org/r/992415

Maintenance_bot removed a project: Patch-For-Review.Jan 24 2024, 9:30 AM

fgiunchedi updated the task description. (Show Details)Jan 24 2024, 4:27 PM

fgiunchedi updated the task description. (Show Details)Jan 25 2024, 9:34 AM

I'll limit the clean up to prometheus=ops (our biggest instance) for now and proceed to delete these blocks:

sqlite> select ulid from replicated_blocks where until_timestamp < datetime('now', '-3 months') and labels like '%prometheus=ops%';
01EBZ9XWRW3FVR20A6APQ69BDY
01EBNF82Q64HPYJM83QQ64DJT7
01ECV0RF68BQSEFGRTF3ZXXXJT
01ECTCJ5PCSGBCNGCJAN7F9RXF
01ECV79P28B8PRY3THM6SQ1ZBN
01ECSS4RQTTF174EH6ND5RB1RE
01EDYX055PGWAJ8F8CBTE8W4X0
01EDY7FJ1DQE9KT04GDAM9DDZN
01EDYRD5KCQCM10TAGXASM63BN
01EDXK4ZYCG6ND6QTSZQ5PGGCN
01EF2XPYMW84D6HNDERGTWDW80
01EF31RPD1EQ9VE9XCCP61G7CZ
01EF1SXPK420ZYD65A5WH0NPV0
01EF1JJ2P6XSKG3F7M2TDH6097
01EG6RT440R0YF6HM04PZ4ARV1
01EG5FQWKYJ7C6ZFYQ699TA8QW
01EG6W0BTW06YDT7TPW7KCJ3HE
01EG635DR3ESV0SMC03G7VEAG4
01EHB5DEK46JREA2JJ6HXSH3J5
01EHAVCTMJX6MJBVXPEG6ZYBDT
01EH9T8AFFCVDCG4PJ4G89SDR6
01EHABREBNGM15JX8W04SAHDKX
01EJEP0PCS8GQVWDX44P10ZSNQ
01EJF9CNTYKTYJZ16TW1R9V3V0
01EJDWWRPMA9BRZGP1TA16GCW7
01EJF0E2ZJT1EJB4W04P0JZ7MR
01EKK3VG0HRA1T23T3QPHZ8177
01EKKDRMMRYS7727BHJSKVNQ3X
01EKJ69SQASWZPMKXSJB7YJ2VW
01EKJGA5CDJ5WAP0X1HHD1PJJ3
01EMPZ2FA30NFN9VSY1R2797V2
01EMQEH6S8N5GFGGTSMF1MDJ9A
01EMPF9GR77C574JRV1W2NM5Q1
01EMPPT4A70JJWX8TF3CCX2F34
01ENTZBZZHK4WAPZX19RZK902Z
01ENVFXEMPXZEFNFG5Z0QHGT2M
01ENVPEYT70FYVBVKFRCCQNKDG
01ENTCJTNTHM3GCRCTQ8N4VRKN
01EPZKD653FG1NDH3YNJ9AARWK
01EPZ613941BVG2TPA8BJN9W42
01EPYFQ2TWWXH432B7ENWV72GF
01EPYN8X9WN2GY0ZZRYZV7WTCD
01ER3Z22JTFR149CD3YRQATHCK
01ER2FNS8TABGZF9ZXCF5YQ71Q
01ER3HYY9KGS7QR26S4VN1TFMM
01ER2R2KPM51KTJBJCBMDQ96AG
01ES7Q26KG2A0TEEN50K5QV1F8
01ES7AMA62EJ0BNER0DS1533P5
01ES7K7MR8GAC91YAW5WG5E5BQ
01ES705XVJ9VB2CMRB4HRSZ1RA
01ETBDR13GPYZ1T2W1PN13J3A8
01ETASBV5B816S5PYBFPDED5JS
01ETBWYB3QF6EXY3VVQBF9CEAZ
01ETAFS3SR31CVXSNZ95JWVEVY
01EVFPVA8QRS8GHQ85HHFW1HXN
01EVEP3X3NKA800XSXRJB35B6E
01EVFTXG1KGRPP2DJSWV6JM82M
01EVF3EJGXWZ9V5XDQ37EF138Z
01EWKK3MJKBY4C0HMNHW390VME
01EWKADP5HZZ4KX94VJ400NQXQ
01EWKW3J8XC65YD2N0HSM7EVZA
01EWJHX3BG8BDBASFF6G3CP812
01EXPC12P30Y8VJZZPM45C5H5Q
01EXR4VFAZCDJX9VMP8WHVG25Q
01EXQQDXNA3P9SEDG6H5J6RJN0
01EXPXSA3C7F52CGC5GC4JWPPS
01EYVT1S5DT2TE507A4C2W44RJ
01EYVD6DN8GDZJ68RDKADN2XYC
01EYVX1Y4KVB52QP6Z2WXZ3YJ2
01EYTE7PSZ9BED92E2PM0GHFTH
01EZZHX38EN0XMC9MMQFQYKCRV
01F0054XAY53EAY60G7PD3RPGR
01EZZX9YYJQBCK9FNZ4D4XCPY2
01EZYTSRR3CMDCRSY1EJ4X4FPG
01F143HBTJ6PHKTJD5AVDBPNWE
01F13VBEX6RN4DX1ED2G7HJABE
01F13KYZQ8QH69PMGKGNH26GZ7
01F12MK1WEK3YRVT7Y3M2XJM2E
01F28DH01YX4D1DCBXHGMWCNM3
01F289WYY2REZB6S4HDT3JPZQW
01F27EQ74XEW1DJDCZ7D2GPDF3
01F27N0K4H83CKGC5WQT2ZHK0B
01F3FCK55J766M17Y3EK6TF1RD
01F3FV3RY822XD0PQQN53FSVGK
01F3EP6ZPH2Z4N57AS7V016HX2
01F3EYVR7YCGHZ3MD4D312XVQE
01F4G20C3ERSSABSN8RGBB290J
01F4GH7WS56JY5PJMHV62TE5B4
01F4FS08N74BET8D1MDNRZ9RQR
01F4FJ553H9W3YDPHA97S2T4SQ
01F5M7T4SJA3KTV1K6B7JZW5PH
01F5MH1CKHJP64T90FMYFDNQ5B
01F5K80EGEMZCW72YW4843YE3K
01F5KVB344YD2D05SNF4697Y8W
01F6RH3SKH356VAX2FCH1ANQ0J
01F6QBZCMGJZ6F2CZW64Z952YX
01F6R9NY5DY34VZTDHYFSA6P1D
01F6QQDMMA1R59FV1RSAP2QYDF
01F7WR814R1C2MD7XHT83GMWX7
01F7V86VRRSRHEE1HNPHGBBZSX
01F7WY79BV6DGANTA88GGXMC9T
01F7VN003MDV0T9YTPS7008BVA
01F90GN4DCKCESQB4GFQAM5NGX
01F8ZZNJ6XD64MB9V5B2W1H2N0
01F90CNNGZN44B39HA0X1NKTCW
01F8ZCXJGH43YK653MVXAXPXW1
01FAEBPBY7YCCHAZ0535RYH8EP
01FAK09GM9X7Z2NBWAGG54RB1J
01FAEZBQWTTX0HK9ZBHCJKZFNM
01FAK471YMBWY3G8N1DFPJF79J
01FB8RVD9J3NW4C3X3V6DCQDTB
01FB7XFF2G1YT815QKM9NXCY8D
01FB963DHDTZY0K8S6H4TVZQXC
01FB777V16A20NZ69W5BMAC96R
01FCDV7QP5H8WPA6W951EC1T97
01FCGRBR67TKJBCN4PYBA5JCMJ
01FCE5N4XWPCA5DVPE9A11YPNB
01FCH4TAA1AYFG045R4FTAWRE3
01FDWV6VS3BTZ3PC2NE3WZGJ78
01FDWNNK8ACR81N6T5ZTWVKYGD
01FDT0V6MNRT6SK5VWG4VNN9S4
01FDTHXCQ641AZEZXCQRKW0S05
01FENB7A3FM5MDYW0QMKTEP4YR
01FEKZG86KCWG5X9STXN2Q3G6T
01FEN5Q2EAKJSKD5ZE8P2WHCH9
01FEKMXRMVSKXKWXKVD5DMNVA7
01FFS2NJMP494MN2AQD52A1CGM
01FFQAGCJ0G3DGZ090ADRYFQ9Y
01FFS7F4BVQGGMM2M7SXZWP2VB
01FFQN9EBD8G0KMVVWBSG9V6M3
01FGYAF548YXTX8SM5BNGAHVW3
01FGWWE6RY7RXR6JVPAR2M25PA
01FGXMNAZFHWARFJVQ1QVEXT9J
01FGW1THB0WQCT9PQNXP1X0X2Q
01FJ1W454E5Q1BD7R9ANK1V0ZH
01FHZQ2KPP1ZKS7632Q30E5AA6
01FJ2NTZ8XST5Q70W8N7NRPV0R
01FJ172FFGP4349VSNQHNVGNSX
01FK6FRKY0Z51NTV9Y4FJ56HM9
01FK4AKXVA6E4W2A40C2WE7ETY
01FK73CXX0V1NVAX4K3Z384C7S
01FK5J78330KZHV52BR7SKFMG4
01FMA4XWFPGH1DFT0EZW5XRFWN
01FMAB5NGTHSEBY22ANJPMNXVX
01FM8MSS3B1FEHNMH1365RTTG4
01FM98EKKZHQJDJBV8GW1S3QVM
01FNERE03XNT2GC64WE2JT3TBS
01FND9GXV39HQKGMNG9N68ZGZC
01FNEXT9MB9NXH6PVPRCEV88QQ
01FNDNKHQ78K5RV4A6YK5E83HK
01FPJA5Q183AV9FF32WJKNBZGV
01FPJ2M21EC4W6XGPKHMSCC0SG
01FPH29VPBWH6J3W2RD2YR2DFT
01FPHPAR6W9EYB76YFTD8A2S7F
01FQM17WSZJVT5W61Z4XBX1D7M
01FQPVM3SDKXHXMMY5Q28EV4GQ
01FQNKW2WEKWTN6D8A4CN978DA
01FQPCNDAFPFAQC7S90FTZ9EH5
01FRRG182C1DPH7GV9G0YKTMYX
01FRTSNJJHKGS0Q9KAM9CRWQQ8
01FRS7QGCKSZ35SKX6XC2WV0AX
01FRTZCDBKRQEN6XJT6RRWZMWV
01FSY7E8Z6WVV2CJC3F5FSWQSR
01FSW33QC19PTNP042VB3ZZ755
01FSYD4R3G154XGAHFGN88MNGP
01FSWF0VRVD1YVDKG3NTYCVGTQ
01FV0MCJZT3W0HKCD9GRZVMYP0
01FV2B80ZBQZTG7QYBNBQJ156W
01FV06HMJQ6EYWAZ7NXX5N5YFV
01FV33PK9VDMPW2EJZ18WF7HN9
01FW7J1XYRF8323RG84WZMDQV3
01FW6X5BSNZ9XXGQBMHKF572XF
01FW4ZAZFQ39WVWQSG86KA001B
01FW5S7J2F51NHFAVXERF8KER3
01FXBBTHK9K3ZEPSR3DJG8Q3CV
01FXB15MRPRQ34YFSA8EB1046R
01FX8HD534EX11Z2E5RYEC015E
01FXACAB7JX30N9ZDTXMY10CRS
01G0BD9ME9GXGKYDW5ZQ4T3MPN
01G080BE3GEB1HC930M3WVY3K6
01G046DYTG2F4EEZJ6PYRS8B1M
01G0B0YEC6ETVV44DHY6B2V2AX
01G0C8JF2JDPW0Q0PYCWWKTEBJ
01G09D09D35MY8PS26DNRBHVBD
01G0BK1AYG9FR9NA538G71CG3A
01G08E1M9K982ED51JRX3A47SJ
01G0R7E7YXF67SHZ45Z8J6SW1F
01G0NJDZK1PC37SYY2734PENVT
01G0RTNANZBHAXHR5QAQ46ZDZR
01G0Q5TB0M12WEXERX8PVN0QYK
01G1WPAQ1Y1ZAM14WREGM73RY9
01G1VGQ9ZYNBDVHAQC1N15B37B
01G1WA6Z5SW2501566DDE96MM9
01G1TQ8V8MEJ9M5DTDQCYXKPGG
01G2ZT73GMRZQ011XP3RHJNVSB
01G2YK4QR7BATMWW6BV5AC1TCM
01G2WMHNZEVE2N4BFNE0GYBPVE
01G2Z40A2E7DBTYPXVQ0W56RA7
01G43X7JPJN2Z057B3FP7ST73C
01G42N6D05473DFKYCT41MZ388
01G433SAM0Q7YXAMYK1GCY9500
01G40XAWANRJHADCRF82WHHZCF
01G57BS3A75GR9R9QMBJBND1GK
01G556YBERW8BEKMTT0WE3FBBE
01G57W7FCA3ZEDCW6H50PQ7BMC
01G56C98W8F0QZQYP33M9J7RRE
01G6BJF95JKPJ4VNXB8P309VNM
01G698A9XV4MAG6AR3STD0V3R4
01G6BXH1Q4DZ0KN54T770H0EKB
01G6ADYQR3BBGQTEZGM0JEW9R8
01G7JMX3V50WVRFXMD4VFQ8A8Q
01G7H9YXTNZ0YFNK6JCJK9T4JQ
01G7JAB7YJ0KS1XX25B0GV8CSM
01G7FDAFB49CMB5V2YYHC4Q1VE
01G9ZC1J87F59FBQJDYN9RZP5X
01GA4XWNTGZRJZQMGH64NB7W9Q
01GA83H64BJ7NNGAYW4QP32N0Y
01GA16X5Q3A1EYYJ7ZYBRYYKJ0
01GA8CNBACAC44CGRN7GEPJFM3
01GA1Z991VB7XHXAJPGP8X85V3
01GAKX8QWVG220YFPW7MF7FW1Q
01GA3PYG870DCPWRBCP4W2VWJT
01GVHR7X4PGY4DM10HQ8BG53VA
01GC4HMFPJPC4FDTWYS3RXS5E0
01GVH8NHJ4TFJ19PJV39FMQFVT
01GC3G9384HKZ0DX1MBV8A40ZA
01GVGXSN15F60N5PSVV0MESR4W
01H0KX0KVXTXG71FJC31V1BERE
01GVCSFWW79VDHKB4G2X10EBXE
01GVHY0NGE8ZFV3KYSE213AEE3
01H0Y78Z94ZMBY4FYEY4Z6562Y
01GVPAW2N4PAXSP69MPAGPMBAQ
01H15QS90CYJEHP9BEGNMFRNPJ
01GWJ37EM16K79G97Z5ZQAVKJ7
01H1135BTWC602K72T2ZZB9007
01GW16GN9J2T0SS86MZ73DMZCC
01H161HXNHR8ZQSEHSAMSWMC6E
01GWMRR63MK0QZRPSGFYZZDYFD
01H0YYW57N6KHY9NWW9CGD7TRV
01H11MFTG9DFRWGFCCQ9NTB245
01GW1SBQRWQWYC9Q2H4FNE0MDA
01GWPFKBJWCM8KEAK4BFM2D7HD
01H11VQGJX1Q50WXN86KD570AT
01GW2CEN5H6GWSYJ7EN66RV49F
01H19QR1Z7HTG7T2QKKZT5CJMM
01GWVNQGWZ4FQQPNK9E3JJF6TG
01H24M2M3QA1GZPJ99S7EPGFT4
01GW2VQ0MRWNV6B1PN751Y894F
01H1A20YHAPA6E8G43THJFS6Y8
01GWY97TDDEW6VKRZTPB0KSFY8
01H24VT9KMS0SVXKZG4MFDYNDH
01H12FW0W9B68M0M3HZQ810466
01GW5MJ4CDJQP6710KRTARXJ4M
01GX1XY9FWBPGQXZ82024JS35W
01H14C3YYPNVVS1RQ67RCYTDTW
01GWDG1VB51675B98J7KNK7F5F
01H1TDZPXD6NDMYYZ2S0MN710E
01GZE5YRWYTXW3T9NJJHRJQH70
01H25B05HR7F7G8DA5GXYEFDTM
01GWE8SVKQCRB8ZJC5N976QYS3
01H1TV5JC85K0FQWN3MD7YPADA
01GZGWMHBR86KY3BQTE0HQG6K4
01H157HW1A3RZ4AZ0DGK8Q3800
01GWGPD1KRNGEN2VVZ2S79TFG3
01H1W29T94RBB2B2NB187EP1XW
01GZQ8MBDT29V9MMXSGE27VWVS
01H0HAY3THKD46TEP8SCWDGE7B
01H1BT4FJ5H8R8XJ23G9BYVW65
01H1WKZG7AVEA92QK9J3J3EP9K
01GZREQAHERZYJXAV84JVV9D8R
01H1C4ZJRNMD27PT8EZZ21196S
01GX6T22JRV9WTC3WEETH8YFW7
01H1X7SAEQS09V61A0BYNQG2PK
01GZTRC318N4N9A490B3AHNJ9X
01H1CXQX3289JY4EFMB506RV7B
01GXBV3FTQZGNQAVVW4JCDHM31
01H1XS4ZHP5260BS6MN3JRF97K
01GZW3PTJEAJ868NBEE3TW8PQ5
01H0KDBBGZGTMRFHPQGVXKQHPS
01H1VR8M4DN0XN41N6B9X3PJNV
01H21FET73R7F9E5JRXST563AM
01H07HNBCBX63RVXGR10C0BKVG
01H0Q26S1YZXJKHE7MRKJW7YTZ
01H29TZW1RHBN4H7N3XSYEQPGJ
01H0NRCSWJDAXB2FD5MG01AMPK
01H2AJZEM26JHKRFY52DW27HV9
01H2C018FJMKNY9AW2W1HJVRK9
01H2BQA0A4HZRKQNDQ9PHJVHRA
01H175P06TVQQ9T3YPHC7DSVSY
01H18D9BGDBDTFK0RMPKJG6G6W
01H2D5RBVCGDE1YR1X8WBSQV1V
01H1DF6MBJ4GVXZG2S70E30KY2
01H2DEDMPX1BN9QENH60PXZP9T
01H1DZF158SFXZ2ZNZDYJ5RC1R
01H2FF0RGN4ZS2GVXSGZBK8857
01H1HE47XS4KB7GTGTR0GP68NF
01H2FQWPPPEN638YVHJMJ5P3GK
01H1HZ8EFWNPT110QWSWHVNTEG
01H1YKQ08AZE1KE6D1EDHV918K
01H2P51WGBGYHVTG9MB5FKPRE3
01H20GNS77Y5JCSY0PABBQ98B3
01H2NCCJXA2C9CACTTSDENC93A
01H2SX4GRT8SKT87WZD80XCX50
01H2KSBFV64WGKW9H8HPK55E23
01H2RWS8WNBNY1XH8863CHAM03
01H2J4S2ZF4FYTW2FHA4B9RQ9D
01H35XTCV62CKFVAB4YQCW8KHG
01H33CHF6ZMPMFAFSG23YM92G7
01H3675BSMR0RDM1BFHTSEJ9J8
01H33V7HRJKJJXSDXP0YJZFH65
01H3P016S797T3G4T6VBH2TMYB
01H3K8RB7CJ5PBZBD6XTDX38GW
01H3NP9FD2MYC6A4465X6KJ97G
01H3JQH2H6F0GNW8W95KGVV9GQ
01H4SNX5V18Z1CG43PNCN0F0P4
01H4Q1TKYWSHZGW0H5ZR34GD3B
01H4SYFVA473G8ACD1Z8VH48VT
01H4QGEFCP0DRQ7BPZCRPA9T67
01H5XZ64W7K4XEGAK30HYD8R9B
01H5TSXZSZ3W1EK4CQQY30ZDCG
01H5YJX56TDMNWQQV5YJ3CYMF7
01H5VX6N2ETSYNX892BMWJCDM1
01H74VXSTQ80YANSEPMEY4QMTH
01H7244MZ3CAYHRG8Y6XENGHFX
01H73W86NMX7EDFGFBT2T80EWF
01H70EY109SHK05M9YFSF1HQ10
01H88GFTGBQ6WAVFBMJ8TGJE30
01H85M90ZZV4R1B3MDBAXYVZ3P
01H879N8RN35TZW8162472Z3EJ
01H844GK96PWPS62D5C8339T35
01H9APCVYTHKF0NH6M6GWQ72WC
01H9CQAJ85SW3AAFEK6EN0E5Q2
01H993XEPBY41Y3H3YQQXNEN4R
01H9DZ8MJP82MM1QWCD2N4Q625
01HA9MVNVA6024ZC1PQMFJKRBW
01HAE5KBYZC8XF1DB1V3ZAR74B
01HABHZ3MR2WW8PSYYBEFWXWNC
01HA9VDJD4G9RP0M6ZMWCCHV02
01HAECMXFDRKB46KKSMK5P5X85
01HABYQSHZCCGA6CJVKC1JX5T6
01HBDR5HVXG3NAYH5KEQC31YXQ
01HBJMGK3GR5DBW3MSEME43VYM
01HBG457DHE9070PDCMCX3Q3DP
01HBEHB664S74M60SDE0MFY7R2
01HBK6W696359QDDQA35W1SEF0
01HBGW68QAG4TH7N1KC13G5EJF
01HCHKZWYEBGC0194VR918HYC4
01HCPNVPDP70WAFBYDVQ2XXZYN
01HCKY3K6K8NN7NRV8KN5Z5BWQ
01HCHV9YX7515NDCM1H3EEGV9Q
01HCPXCZ3EJXMGGJ76G1AG9Y1W
01HCMC5S3X8N29GZXC5ZGFD0Y8
01HDQ61N1CHEJBY2M9CK2J84A2
01HDZC48N0423VHK8VXW5B2Z6F
01HDTTEKSNNJ2Q6SX5XAXA00GY
01HDQDG136RZ5HWA6NHRY50AXP
01HDZKW7QGXV5VZBHM3650D3NB
01HDW1GHCJ6BH2Q97WWEXES2C4

Mentioned in SAL (#wikimedia-operations) [2024-01-29T09:17:57Z] <godog> mark for deletetion and cleanup replicated thanos blocks for prometheus=ops, older than 3 months, all resolutions - T351927

In T351927#9493267, @Stashbot wrote:

Mentioned in SAL (#wikimedia-operations) [2024-01-29T09:17:57Z] <godog> mark for deletetion and cleanup replicated thanos blocks for prometheus=ops, older than 3 months, all resolutions - T351927

This is now complete, thanos is at ~71% utilized; roughly where it was ~7 months ago

In T351927#9497426, @fgiunchedi wrote:

This is now complete, thanos is at ~71% utilized; roughly where it was ~7 months ago

thank you!

Stalling this for now, we'll need to come back and do capacity planning and determine if we need hw for the next budget cycle

fgiunchedi moved this task from Doing to Radar on the User-fgiunchedi board.Jan 31 2024, 1:56 PM

fgiunchedi mentioned this in T357747: Capacity planning/estimation for Thanos.Feb 16 2024, 7:54 AM

Change 1005442 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: fix bucket-query tools import

https://gerrit.wikimedia.org/r/1005442

gerritbot added a project: Patch-For-Review.Feb 21 2024, 9:23 AM

Change 1005442 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: fix bucket-query tools import

https://gerrit.wikimedia.org/r/1005442

Maintenance_bot removed a project: Patch-For-Review.Feb 22 2024, 1:30 PM

Resolving this since capacity is under control now and we have more coming next FY as per T357747: Capacity planning/estimation for Thanos

@fgiunchedi I see (when looking for something else) that thanos is up to "Warning" for disk usage...

Yes we'll be trimming the retention more this week @MatthewVernon

Change #1017806 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] titan: trim 5m retention to 4y + 1w

https://gerrit.wikimedia.org/r/1017806

gerritbot added a project: Patch-For-Review.Apr 8 2024, 9:32 AM

Change #1017806 merged by Filippo Giunchedi:

[operations/puppet@production] titan: trim 5m retention to 4y + 1w

https://gerrit.wikimedia.org/r/1017806

Change #1018644 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] titan: trim 5m retention to 3y + 2w

https://gerrit.wikimedia.org/r/1018644

Change #1018644 merged by Filippo Giunchedi:

[operations/puppet@production] titan: trim 5m retention to 3y + 2w

https://gerrit.wikimedia.org/r/1018644

Change #1025806 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] titan: trim 5m retention to 2y + 2w

https://gerrit.wikimedia.org/r/1025806

Change #1025806 merged by Filippo Giunchedi: