Page MenuHomePhabricator

Disk space saturation (/srv) on Titan hosts
Open, Needs TriagePublic

Assigned To
None
Authored By
tappof
Fri, Nov 14, 4:21 PM
Referenced Files
F70853113: image.png
Thu, Dec 4, 10:56 AM
F70802092: image.png
Mon, Dec 1, 5:32 PM
F70713702: image.png
Fri, Nov 28, 4:10 PM
F70300926: image.png
Thu, Nov 20, 12:04 PM
F70216981: image.png
Fri, Nov 14, 4:21 PM
F70216951: image.png
Fri, Nov 14, 4:21 PM
F70216947: image.png
Fri, Nov 14, 4:21 PM

Description

We received alerts related to /srv on hosts titan1001, titan1002, and titan2002 (titan2001 has a larger RAID).

They started increasing around October 3rd.

image.png (797×2 px, 75 KB)

A couple of weeks earlier, we also noticed an increase in the bytes transferred from Swift, reasonably due to: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184566

image.png (762×2 px, 94 KB)

https://w.wiki/G5qB

We tried deleting older files on /srv/thanos-store (after depooling host titan1001 and stopping thanos-store), but the cache was immediately repopulated to exactly the same size as before, in just a few minutes…

image.png (787×2 px, 66 KB)

Moreover, we noticed some inconsistencies in the storage configuration between the titan hosts:

  • titan1001 and titan1002 have some spare disks already present but they are missing on titan2002;
  • titan2001 is striping multiple partitions on the same disk.

Event Timeline

I've set up the spare disks already present in titan1001 as an 800G lvm volume to host /srv. just kicked off an initial sync, and after thats complete will depool titan1001 to stop services for a final sync and remount. After that we can add the backing devices for the previous /srv filesystem (/dev/md2) into this LVM volume as well and effectively double our capacity.

Mentioned in SAL (#wikimedia-operations) [2025-11-14T18:35:29Z] <herron> titan1001: switch /srv mount from /dev/md2 to /dev/vg0/srv T410152

I've set up the spare disks already present in titan1001 as an 800G lvm volume to host /srv. just kicked off an initial sync, and after thats complete will depool titan1001 to stop services for a final sync and remount. After that we can add the backing devices for the previous /srv filesystem (/dev/md2) into this LVM volume as well and effectively double our capacity.

This has been done. Titan1001 /srv is now at just under 70% used with a healthy reserve in the lvm vg for a rainy day.

titan1001:~# df -h /srv
Filesystem           Size  Used Avail Use% Mounted on
/dev/mapper/vg0-srv 1007G  682G  326G  68% /srv

titan1001:~# vgs
  VG  #PV #LV #SN Attr   VSize  VFree
  vg0   4   1   0 wz--n- <1.60t <613.04g

Mentioned in SAL (#wikimedia-operations) [2025-11-17T08:32:27Z] <tappof> titan1002: switch /srv mount from /dev/md2 to /dev/vg0/srv T410152

/srv was also moved to the VG on titan1002.

Mentioned in SAL (#wikimedia-operations) [2025-11-17T13:53:10Z] <tappof> titan2002: switch /srv mount from /dev/md2 to /dev/vg0/srv T410152

/srv was also moved to the VG on titan2002.

Just a heads-up: on hosts titan1002 and titan2002, the disks previously associated with the software RAID0 have not yet been added to the new VG. With the current settings, we’ve gained roughly 10 days to further explore the issue.

How changing the cutoff has affected disk utilization on /srv/thanos-store (tentative analysis)

  1. We changed the --max-time parameter of Thanos Store from -15d to -1d.
  2. This effectively caused a 5x increase in the amount of data transferred from the object store.
  3. One compactor cycle takes roughly 2 weeks.
  4. Just considering point 1, we are potentially increasing the amount of data that can reside under /srv/thanos-store.
  5. Every day, the compactor creates new fresh blocks. Blocks are considered for downsampling only when they are older than 2 days ("All raw resolution metrics that are older than 40 hours are downsampled at a 5m resolution").
  6. Over time, with a cutoff of -1d, Thanos Store will constantly cache the new blocks created by the compactor (compacted and/or downsampled).
  7. In the short term, however, the blocks already present in the store have not yet been processed as deletable by the compactor, so they are effectively still valid. The store stops using them until they are no longer valid (i.e., removable), after which it starts requesting new blocks — but this time much more frequently, since previously a block remained valid for 2 weeks (with --max-time -15d and a compaction cycle duration of ~14d, blocks were probabilistically replaced almost simultaneously). Today, a block remains valid for about one day because it is then compacted or downsampled, or both (and therefore effectively becomes a new block with its data merged with other blocks).

I’d like to revert the patch to see if restoring the old cutoff value will reverse the trend in disk utilization. If I’m correct, it will take some time to return to the old pattern.

We also need to align the storage configurations between hosts, taking into account T396862: Improve titan hosts stateless-ness.

Thanks for the summary, before we pull the trigger on the revert could we try a couple alternatives?

a) Try setting the download strategy to lazy, and pair this with a frequent cleanup job on the data-dir. In theory this will prevent the data-dir from immediately refilling after pruning old/large blocks and only keep actively used index headers on disk.

--store.index-header-lazy-download-strategy=eager
                                 Strategy of how to download index headers lazily. Supported values: eager, lazy. If eager, always download index header during initial load. If lazy, download index header during query time.

b) Test the impact of disabling index header cache and relying solely on memcached - This may be too slow, but if performance was ok it would remove the on disk header cache requirements entirely.

--cache-index-header       Cache TSDB index-headers on disk to reduce startup time. When set to true, Thanos Store will download index headers from remote object storage on startup and create a header file on disk. Use --data-dir to set the directory in which index headers will be downloaded.

B should be fairly straightforward to test, we could set header cache to false on a single host while watching overall thanos performance and depool+revert at the first sign of trouble.

Sure, I think we can explore any route that will fix our scenario. That said, we’re quite happy with the current Thanos/Prometheus performance, so I’d like to better understand the real needs behind having such a short cutoff window of just one day.

I was looking at the documentation and noticed that for backfilling metrics we request data from the Thanos Querier. Isn’t it able to request data from the sidecar as well when the query comes from promtool?

It is now affecting the compactor as well.

11:23:34 jinxer-wm │ FIRING: DiskSpace: Disk space titan2001:9100:/srv 0.3483% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=titan2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
11:38:34 jinxer-wm │ RESOLVED: DiskSpace: Disk space titan2001:9100:/srv 0.004537% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=titan2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
12:25:12 jinxer-wm │ FIRING: ThanosCompactHalted: Thanos Compact has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted

image.png (1×927 px, 113 KB)

I’d suggest reverting the patch, as the compactor is currently unable to do its job. This could lead to a thrashing situation that would be harder to recover from than the one we’re experiencing now. Once we’ve confirmed we’re no longer in troubled waters, we can investigate why backfilling metrics was difficult without such a cutoff and eventually evaluate alternatives.
I’m not entirely sure this is the root cause, but I prefer to give it a try before changing other configurations that could impact performance.

Mentioned in SAL (#wikimedia-operations) [2025-11-24T14:22:19Z] <tappof> Remove unused md2 and add its devices to vg0 on titan1002 T410152

Mentioned in SAL (#wikimedia-operations) [2025-11-24T14:42:31Z] <tappof> Remove unused md2 and add its devices to vg0 on titan2002 T410152

To avoid a revert on Friday and to be in the driver’s seat during the weekend, 100 GB were added to the VGs on titan1001, titan1002, and titan2002.

image.png (328×1 px, 53 KB)

Mentioned in SAL (#wikimedia-operations) [2025-11-28T16:17:01Z] <tappof> Added 100 GB to /srv LV on titan1001/1002/2002 (T410152)

Mentioned in SAL (#wikimedia-operations) [2025-12-01T15:56:40Z] <tappof> "thanos-store: set cutoff days to 1" reverted on titan1001 (1/4) T410152

Mentioned in SAL (#wikimedia-operations) [2025-12-01T16:28:35Z] <tappof> "thanos-store: set cutoff days to 1" reverted on titan1002 (2/4) T410152

Mentioned in SAL (#wikimedia-operations) [2025-12-01T17:17:16Z] <tappof> "thanos-store: set cutoff days to 1" reverted on titan2002 (3/4) T410152

Mentioned in SAL (#wikimedia-operations) [2025-12-01T17:39:16Z] <tappof> "thanos-store: set cutoff days to 1" reverted on titan2001 (4/4) T410152

The trend has changed after the revert.

image.png (1×1 px, 53 KB)