Page MenuHomePhabricator

ThanosCompactHasNotRun: Thanos Compact has not uploaded anything for last 24 hours.
Open, Needs TriagePublic

Assigned To
None
Authored By
tappof
Oct 6 2025, 2:36 PM
Referenced Files
F66734474: Screenshot 2025-10-06 at 3.42.14 PM.png
Oct 6 2025, 7:55 PM
F66733841: image.png
Oct 6 2025, 2:36 PM
F66733835: image.png
Oct 6 2025, 2:36 PM
F66733822: image.png
Oct 6 2025, 2:36 PM

Description

The alert mentioned in the subject was received on 2025-10-06 at 10:11:27 (CEST).

Extract from the Thanos Compactor logs on titan2001:

Oct 05 08:10:44 titan2001 thanos-compact[4096003]: ts=2025-10-05T08:10:44.453697499Z caller=compact.go:665 level=error msg="retriable error" err="filter metas: filter blocks marked for no compaction: get file: 01H758JR8AF56JXKDRYWV8SJ1D/no-compact-mark.json: Get \"https://thanos-swift.discovery.wmnet/thanos/01H758JR8AF56JXKDRYWV8SJ1D/no-compact-mark.json\": context deadline exceeded"
Oct 05 08:10:44 titan2001 thanos-compact[4096003]: ts=2025-10-05T08:10:44.453879971Z caller=compact.go:636 level=error msg="retriable error" err="syncing metas: filter metas: filter blocks marked for no compaction: get file: 01H758JR8AF56JXKDRYWV8SJ1D/no-compact-mark.json: Get \"https://thanos-swift.discovery.wmnet/thanos/01H758JR8AF56JXKDRYWV8SJ1D/no-compact-mark.json\": context deadline exceeded"
Oct 05 08:10:44 titan2001 thanos-compact[4096003]: ts=2025-10-05T08:10:44.454132613Z caller=compact.go:570 level=error msg="retriable error" err="compaction: sync: filter metas: filter blocks marked for no compaction: get file: 01H758JR8AF56JXKDRYWV8SJ1D/no-compact-mark.json: Get \"https://thanos-swift.discovery.wmnet/thanos/01H758JR8AF56JXKDRYWV8SJ1D/no-compact-mark.json\": context deadline exceeded"
Oct 05 08:15:44 titan2001 thanos-compact[4096003]: ts=2025-10-05T08:15:44.454287315Z caller=compact.go:665 level=error msg="retriable error" err="filter metas: filter blocks marked for no downsample: context deadline exceeded"
Oct 05 08:15:44 titan2001 thanos-compact[4096003]: ts=2025-10-05T08:15:44.454305097Z caller=compact.go:636 level=error msg="retriable error" err="syncing metas: filter metas: filter blocks marked for no downsample: context deadline exceeded"
...
Oct 06 14:17:50 titan2001 thanos-compact[332360]: ts=2025-10-06T14:17:50.13876872Z caller=compact.go:1519 level=info msg="start sync of metas"
Oct 06 14:22:50 titan2001 thanos-compact[332360]: ts=2025-10-06T14:22:50.139623655Z caller=compact.go:665 level=error msg="retriable error" err="filter metas: filter blocks marked for no downsample: context deadline exceeded"
Oct 06 14:22:50 titan2001 thanos-compact[332360]: ts=2025-10-06T14:22:50.139670241Z caller=compact.go:636 level=error msg="retriable error" err="syncing metas: filter metas: filter blocks marked for no downsample: context deadline exceeded"
Oct 06 14:22:50 titan2001 thanos-compact[332360]: ts=2025-10-06T14:22:50.139705525Z caller=compact.go:570 level=error msg="retriable error" err="compaction: sync: filter metas: filter blocks marked for no downsample: context deadline exceeded"
...

It seems that the Thanos Compactor is unable to fetch all the metadata within a reasonable amount of time.

This dashboard https://grafana.wikimedia.org/goto/UMnUYe3Hg?orgId=1 highlights some errors starting from 2025-10-02 12:06:00.

image.png (610×1 px, 107 KB)

The duration of requests has also changed accordingly:

image.png (871×1 px, 69 KB)

Taking a look to the thanos swift dashboard https://grafana.wikimedia.org/goto/p0nvEe3NR?orgId=1, the system stats graphs show noticeable pattern variations compared to the last 30 days.

image.png (568×1 px, 163 KB)

Is it possible that some other accounts are currently putting heavy load on the cluster?

Event Timeline

@elukey would you be able to rule out if this is related to tegola? I see a sharp rise in thanos swift-proxy utilization on Oct 2 that seems to correlate with IRC discussion about tegola maintenance and seeing today a lot of errors following the pattern below in the swift-proxy logs alongside thanos

Screenshot 2025-10-06 at 3.42.14 PM.png (2×3 px, 603 KB)

Oct  6 19:30:02 thanos-fe1005 proxy-server: ERROR with Object server 10.64.156.15:6020/objects18 re: Trying to GET /v1/AUTH_tegola/tegola-swift-eqiad-v002/tegola-cache/osm/10/178/359: ConnectionTimeout (0.5s) (txn: ) (client_ip: 10.67.154.196)
Oct  6 19:30:12 thanos-fe1005 proxy-server: ERROR with Object server 10.64.158.5:6026/objects23 re: Trying to GET /v1/AUTH_tegola/tegola-swift-eqiad-v002/tegola-cache/osm/13/4796/2378: ConnectionTimeout (0.5s) (txn: ) (client_ip: 10.67.131.67)
Oct  6 19:31:53 thanos-fe1005 proxy-server: ERROR with Object server 10.64.164.15:6017/objects15 re: Trying to GET /v1/AUTH_tegola/tegola-swift-eqiad-v002/tegola-cache/osm/11/1023/683: ConnectionTimeout (0.5s) (txn: ) (client_ip: 10.67.131.67)

@herron it was definitely Tegola, I was doing a cache refresh before the codfw cluster is repooled (it is a one-off that we do in these situations), that meant the re-creation of 90M tiles :(

I see that the metrics are better now, but we are going to repool the codfw cluster soon (so it will serve live traffic etc..). Lemme know if it is a concern, and/or if the metrics are good now. We can probably try to be more gentle with the cache refresh, it is a k8s cron that takes a long time to run but that can be parallelized easily (we run it from multiple pods).

@herron it was definitely Tegola, I was doing a cache refresh before the codfw cluster is repooled (it is a one-off that we do in these situations), that meant the re-creation of 90M tiles :(

I see that the metrics are better now, but we are going to repool the codfw cluster soon (so it will serve live traffic etc..). Lemme know if it is a concern, and/or if the metrics are good now. We can probably try to be more gentle with the cache refresh, it is a k8s cron that takes a long time to run but that can be parallelized easily (we run it from multiple pods).

Thanks! That makes sense. Yes I if there isn't a major downside to the re-creation taking longer I'd be in favor of reducing concurrency. The load towards swift-proxy still looks quite a bit higher than normal today (looking at https://grafana.wikimedia.org/goto/IJA35m6Hg?orgId=1)

Change #1194612 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: reduce tegola's cronjob paralleism

https://gerrit.wikimedia.org/r/1194612

Change #1194612 merged by Elukey:

[operations/deployment-charts@master] services: reduce tegola's cronjob paralleism

https://gerrit.wikimedia.org/r/1194612

Cut the parallelism in a half, let's see if things improve.

I am trying to recap the problem since in the above dashboards I see a mixture of eqiad and codfw, @herron was it the problem in both?

I am saying that because we are operating on the codfw cluster at the moment, not the eqiad one (that still runs the old software stack etc..). The only big change that we did was on Oct 1st, when the k8s cluster upgrade happened: we depooled eqiad and pooled in codfw, but we restored the old status at the end of the day (see https://sal.toolforge.org/production?p=0&q=kartotherian&d=).

@herron it was definitely Tegola, I was doing a cache refresh before the codfw cluster is repooled (it is a one-off that we do in these situations), that meant the re-creation of 90M tiles :(

I see that the metrics are better now, but we are going to repool the codfw cluster soon (so it will serve live traffic etc..). Lemme know if it is a concern, and/or if the metrics are good now. We can probably try to be more gentle with the cache refresh, it is a k8s cron that takes a long time to run but that can be parallelized easily (we run it from multiple pods).

Thanks! That makes sense. Yes I if there isn't a major downside to the re-creation taking longer I'd be in favor of reducing concurrency. The load towards swift-proxy still looks quite a bit higher than normal today (looking at https://grafana.wikimedia.org/goto/IJA35m6Hg?orgId=1)

It doesn't seem different from earlier (and this is eqiad): https://grafana.wikimedia.org/d/lxIVOKq4k/units-resource-usage-overview?orgId=1&from=now-30d&to=now&timezone=utc&var-site=eqiad&var-cluster=thanos

It looks like something got depooled on Sep 23rd and repooled on Oct 2nd, it doesn't match with Tegola's changes.

@herron forgot to mention that I am currently warming up the maps eqiad tegola cache, with reduced workers, lemme know how Thanos is doing!