User Details
- User Since
- Jul 23 2024, 9:16 AM (71 w, 4 d)
- Availability
- Available
- IRC Nick
- tappof
- LDAP User
- Tiziano Fogli
- MediaWiki User
- Tiziano Fogli [ Global Accounts ]
Yesterday
Thu, Dec 4
The gap is related to the revert of the patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184566
and is temporary, since the data is present in the TSDB blocks but is not being served by the Thanos Querier.
The trend has changed after the revert.
Just adding a note about the start and end dates of the gap.
Tue, Dec 2
Due to the issues described in T410152: Disk space saturation (/srv) on Titan hosts, reverting the patch https://gerrit.wikimedia.org/r/1184566
was necessary.
Mon, Dec 1
Another side effect:
Fri, Nov 28
To avoid a revert on Friday and to be in the driver’s seat during the weekend, 100 GB were added to the VGs on titan1001, titan1002, and titan2002.
Thanos ruler points to query-frontend as its thanos querier:
/usr/bin/thanos rule ... --query http://localhost:16902 ... /usr/bin/thanos query-frontend ... --http-address 0.0.0.0:16902 ...
Thu, Nov 27
Wed, Nov 26
Just updated the dashboard: https://grafana.wikimedia.org/goto/PzmXbiWvg?orgId=1
Fri, Nov 21
Found the issue: the rules configured in modules/profile/manifests/microsites/monitoring.pp:67 are generating a regex that also matches the ones generated by modules/profile/manifests/query_service/monitor/ldf.pp.
Thu, Nov 20
I’d suggest reverting the patch, as the compactor is currently unable to do its job. This could lead to a thrashing situation that would be harder to recover from than the one we’re experiencing now. Once we’ve confirmed we’re no longer in troubled waters, we can investigate why backfilling metrics was difficult without such a cutoff and eventually evaluate alternatives.
I’m not entirely sure this is the root cause, but I prefer to give it a try before changing other configurations that could impact performance.
It is now affecting the compactor as well.
Tue, Nov 18
Yes, thank you.
Sure, I think we can explore any route that will fix our scenario. That said, we’re quite happy with the current Thanos/Prometheus performance, so I’d like to better understand the real needs behind having such a short cutoff window of just one day.
Mon, Nov 17
- We changed the --max-time parameter of Thanos Store from -15d to -1d.
- This effectively caused a 5x increase in the amount of data transferred from the object store.
- One compactor cycle takes roughly 2 weeks.
- Just considering point 1, we are potentially increasing the amount of data that can reside under /srv/thanos-store.
- Every day, the compactor creates new fresh blocks. Blocks are considered for downsampling only when they are older than 2 days ("All raw resolution metrics that are older than 40 hours are downsampled at a 5m resolution").
- Over time, with a cutoff of -1d, Thanos Store will constantly cache the new blocks created by the compactor (compacted and/or downsampled).
- In the short term, however, the blocks already present in the store have not yet been processed as deletable by the compactor, so they are effectively still valid. The store stops using them until they are no longer valid (i.e., removable), after which it starts requesting new blocks — but this time much more frequently, since previously a block remained valid for 2 weeks (with --max-time -15d and a compaction cycle duration of ~14d, blocks were probabilistically replaced almost simultaneously). Today, a block remains valid for about one day because it is then compacted or downsampled, or both (and therefore effectively becomes a new block with its data merged with other blocks).
/srv was also moved to the VG on titan2002.
/srv was also moved to the VG on titan1002.
Fri, Nov 14
Thu, Nov 13
Wed, Nov 12
I was running some tests related to the spike we saw here: https://w.wiki/_mzMp .
Nov 5 2025
Please hold off on working on this task until further notice.
We may have found a way to handle the Icinga meta-monitoring reliably using the same approach as the Prometheus/Thanos meta-monitor.
Nov 3 2025
It seems that some of the eventgate pods were restarted between 16:00 and 17:00 (Just a quick check by looking at the metrics — I didn’t dig into the logs or anything else).
Oct 31 2025
Oct 30 2025
Oct 28 2025
I saw the alerts on the ALERTS metric: https://w.wiki/FqSi .
I think there was a silence rule in place, so you didn't get any notifications.
Oct 25 2025
Oct 21 2025
Oct 20 2025
I found that the certificates used by Prometheus to authenticate against Kubernetes are being renewed every hour. I believe the root cause lies in modules/profile/manifests/prometheus/k8s.pp:22, where renew_seconds is set to 365d and 23h.




