Page MenuHomePhabricator

Not enough space on titan2001 for thanos-compact
Open, Stalled, Needs TriagePublic

Description

We're at the point where disk space on titan hosts is not enough for certain kinds of thanos-compact operations, i.e. the compactor runs out of space

Mar 02 04:50:45 titan2001 thanos-compact[641856]: level=error ts=2024-03-02T04:50:45.868960443Z caller=compact.go:487 msg="critical error detected; halting" err="compaction: group 0@10531109435386935375: compact blocks [/srv/thanos-compact/compact/0@10531109435386935375/01HQ15VXWX3H8ZN0CDZKHJ6QMA/srv/thanoscompact/compact/0@10531109435386935375/01HQ25TA9WT5N78X476Y9SS4KG /srv/thanos-compact/compact/0@10531109435386935375/01HQ78BNP20PGX5SGDJEGJ914A /srv/thanoscompact/compact/0@10531109435386935375/01HQCHHWRCDXYQ5VQFN9GN7X45 /srv/thanoscompact/compact/0@10531109435386935375/01HQF980AEFA46FRSWA286ZH7Z /srv/thanoscompact/compact/0@10531109435386935375/01HQMR4F0Q73R9T9N6V7TCS4QC /srv/thanoscompact/compact/0@10531109435386935375/01HQSZFWAN1A8Q01CERPR0ACK5]: 2 errors: populate block: add series: write series data: write /srv/thanoscompact/compact/0@10531109435386935375/01HQYNSETD2HYVH41WE8EPE1NB.tmp-for-creation/index: no space left on device; write /srv/thanoscompact/compact/0@10531109435386935375/01HQYNSETD2HYVH41WE8EPE1NB.tmp-for-creation/index: no space left on device"

We have requested additional SSDs for all titan hosts as part of next year's capex, though it looks like we need to speed up. I'll ask dcops in codfw if they have a couple of big SSDs we can temporarily install

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2024-03-05T08:47:25Z] <godog> add new disk to titan2001 /srv - T359068

With the new 1.6TB disk in place we have ~2.2TB of raid0, which is great. This is fine for short/medium term, not long term because it means thanos-compact is able to complete a cycle only on titan2001 now. We'll get the other hosts in line in terms of space soon though (next FY or this FY is TBD)

In terms of details I didn't want to introduce LVM just to extend the raid0 with a bigger drive. So the 1.6TB drive is 4x 400GB partitions instead, which joined the existing ~400GB partitions for raid0 of the existing/standard drives.

md2 : active raid0 sdc4[5] sdc3[4] sdc2[3] sdc1[2] sdb4[1] sda4[0]
      2340883456 blocks super 1.2 512k chunks

Not the best in theory because receives sdc 4x the load on the raid0, and 100% Good Enough™ for this situation

fgiunchedi renamed this task from Not enough space on titan hosts for thanos-compact to Not enough space on titan2001 for thanos-compact.Mar 5 2024, 10:23 AM
fgiunchedi changed the task status from Open to Stalled.
fgiunchedi added a project: User-fgiunchedi.

Stalling until thanos-compact finishes its cycle, and we can assess how much space is used too