Page MenuHomePhabricator

Thanos compaction stopped due to local filesystem space shortage
Closed, ResolvedPublic

Description

The thanos compact process downloads blocks from swift locally and runs compaction algorithm, then uploads the resulting block back to swift. Today the process stopped due to local disk space being exhausted:

Apr 15 14:03:39 thanos-fe2001 thanos-compact[22463]: level=error ts=2021-04-15T14:03:39.705066914Z caller=compact.go:375 msg="critical error detected; halting" err="compaction: group 0@10531109435386935375: compact blocks [/srv/thanos-compact/compact/0@10531109435386935375/01F2B1RV4HZKGCQ8HZSGRSJZBM /srv/thanos-compact/compact/0@10531109435386935375/01F2G5MFTTTSX6YRPZW62XDE1B /srv/thanos-compact/compact/0@10531109435386935375/01F2NA8AVHNGP8XY79V09TY1QT /srv/thanos-compact/compact/0@10531109435386935375/01F2TEYJ8WVSWGGW1BFNCE1PGG /srv/thanos-compact/compact/0@10531109435386935375/01F2ZMB66N988DKQ20MB3W5HZR /srv/thanos-compact/compact/0@10531109435386935375/01F34SV4YPEQY4KDPTRZ24A2KF /srv/thanos-compact/compact/0@10531109435386935375/01F39WS9M0KYSFXG0HRSZW14NS]: 2 errors: populate block: write chunks: preallocate: no space left on device; sync /srv/thanos-compact/compact/0@10531109435386935375/01F3AVXDA8YZ6CG7N60XRG4AFH.tmp-for-creation/chunks/000313: file already closed"

The local space is pretty much all allocated already:

root@thanos-fe2001:~# vgs
  VG  #PV #LV #SN Attr   VSize   VFree 
  vg0   1   3   0 wz--n- 446.72g <9.35g
root@thanos-fe2001:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
udev                   32G     0   32G   0% /dev
tmpfs                 6.3G  666M  5.7G  11% /run
/dev/mapper/vg0-root   73G  4.1G   66G   6% /
tmpfs                  32G  4.0K   32G   1% /dev/shm
tmpfs                 5.0M     0  5.0M   0% /run/lock
tmpfs                  32G     0   32G   0% /sys/fs/cgroup
/dev/mapper/vg0-srv   356G  191G  165G  54% /srv
tmpfs                 6.3G     0  6.3G   0% /run/user/4849

Possible solutions:

  1. Move thanos-compact to another host with more space (e.g. a backend)
  2. Switch thanos-fe to raid0 for /srv
  3. Store blocks that are being compacted to object storage itself. This isn't supported yet by Thanos itself, although it should work by mounting object storage locally

Event Timeline

Since the frontends are meant to be stateless, I think I prefer #2 over #1 to avoid special-casing a data partition on one of the backends. Double the space for compaction should buy us quite some time.

Change 680257 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] install_server: move thanos-fe to raid0 for /srv

https://gerrit.wikimedia.org/r/680257

Change 680257 merged by Filippo Giunchedi:

[operations/puppet@production] install_server: move thanos-fe to raid0 for /srv

https://gerrit.wikimedia.org/r/680257

Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts:

thanos-fe2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104160941_filippo_25233_thanos-fe2001_codfw_wmnet.log.

Completed auto-reimage of hosts:

['thanos-fe2001.codfw.wmnet']

and were ALL successful.

Change 680309 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: hide diffs for files with sensitive data

https://gerrit.wikimedia.org/r/680309

Issue has been mitigated by reimaging thanos-fe2001 (the host that runs thanos-compact) with a raid0 /srv, I'll reimage the other frontends next week

Change 680309 merged by Filippo Giunchedi:

[operations/puppet@production] swift: hide diffs for files with sensitive data

https://gerrit.wikimedia.org/r/680309

Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts:

thanos-fe2002.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104190930_filippo_2010_thanos-fe2002_codfw_wmnet.log.

Completed auto-reimage of hosts:

['thanos-fe2002.codfw.wmnet']

and were ALL successful.

Change 681010 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: force creation of /var/log/swift symlink

https://gerrit.wikimedia.org/r/681010

Change 681010 merged by Filippo Giunchedi:

[operations/puppet@production] swift: force creation of /var/log/swift symlink

https://gerrit.wikimedia.org/r/681010

Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts:

thanos-fe2003.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104200710_filippo_5012_thanos-fe2003_codfw_wmnet.log.

Completed auto-reimage of hosts:

['thanos-fe2003.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts:

thanos-fe1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104200748_filippo_13723_thanos-fe1002_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['thanos-fe1002.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts:

thanos-fe1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104200834_filippo_24442_thanos-fe1003_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['thanos-fe1003.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts:

thanos-fe1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104260734_filippo_7598_thanos-fe1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['thanos-fe1001.eqiad.wmnet']

and were ALL successful.

fgiunchedi claimed this task.

All thanos-fe hosts reimaged, resolving