Page MenuHomePhabricator

/srv on an-web1001 is low on disk space
Closed, ResolvedPublic

Assigned To
Authored By
BTullis
Oct 27 2023, 9:58 AM
Referenced Files
F41545855: image.png
Nov 29 2023, 12:26 PM
F40479749: image.png
Oct 27 2023, 10:30 AM
F40477027: image.png
Oct 27 2023, 10:05 AM
F40475124: image.png
Oct 27 2023, 9:58 AM

Description

We have received a warning that the /srv partition on an-web1001 is low on disk space.

btullis@an-web1001:~$ df -h /srv
Filesystem           Size  Used Avail Use% Mounted on
/dev/mapper/vg0-srv  1.4T  1.3T   63G  96% /srv

It's been sitting at over 93% for some time, but has just taken a trend upwards. We should check it out and see if there is anything we need to do.

image.png (990×1 px, 73 KB)

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=an-web1001&var-datasource=thanos&var-cluster=analytics&viewPanel=12

Event Timeline

BTullis triaged this task as High priority.Oct 27 2023, 9:58 AM
BTullis moved this task from Incoming to In Progress on the Data-Platform-SRE board.

Ah good, it's levelled out at 97% and does not seem to be increasing.

image.png (976×1 px, 78 KB)

Not good. It is tracking up again.

image.png (920×1 px, 58 KB)

I started a conversation about it on Slace here: https://wikimedia.slack.com/archives/CSV483812/p1698400981595519

I'm running commands like this to try to find out what has been modified today and whether there is anything obvious that we can do to mitigate the situation.

btullis@an-web1001:/srv/published-rsynced$ sudo find . -daystart -mtime 0 -print

This file jumped out at me:

btullis@an-web1001:/srv/published-rsynced$ ls -lah ./stat1007/datasets/one-off/paragon/
total 76G
drwxrwxr-x  2 stats wikidev 4.0K Oct 27 10:08 .
drwxrwxr-x 13 stats wikidev 4.0K Oct 26 15:42 ..
-rw-rw-r--  2 stats wikidev  41G Oct 27 09:45 features_all_2022.parquet.zip
-rw-------  2 stats wikidev  36G Oct 27 10:27 .features_all_2022.parquet.zip.PZ2xgU

This correlates with:

btullis@an-web1001:/srv/published-rsynced$ sudo lsof |grep published-rsynced
bash       4635                          btullis  cwd       DIR              253,2        4096   76808193 /srv/published-rsynced
rsync      7763                            stats  cwd       DIR              253,2        4096   18087947 /srv/published-rsynced/stat1007
rsync      7763                            stats  rtd       DIR              253,2        4096   76808193 /srv/published-rsynced
rsync      7765                            stats  cwd       DIR              253,2        4096   18087947 /srv/published-rsynced/stat1007
rsync      7765                            stats  rtd       DIR              253,2        4096   76808193 /srv/published-rsynced
rsync      7765                            stats    4r      REG              253,2 43437522944   19238066 /srv/published-rsynced/stat1007/datasets/one-off/paragon/features_all_2022.parquet.zip
rsync      7765                            stats    7u      REG              253,2 39787954176   19238212 /srv/published-rsynced/stat1007/datasets/one-off/paragon/.features_all_2022.parquet.zip.PZ2xgU
sudo      26149                             root  cwd       DIR              253,2        4096   76808193 /srv/published-rsynced
grep      26150                          btullis  cwd       DIR              253,2        4096   76808193 /srv/published-rsynced
lsof      26151                             root  cwd       DIR              253,2        4096   76808193 /srv/published-rsynced
lsof      26152                             root  cwd       DIR              253,2        4096   76808193 /srv/published-rsynced

So perhaps that file /srv/published-rsynced/stat1007/datasets/one-off/paragon/features_all_2022.parquet.zip has been updated and it's making a temporary copy during the rsync, at which point the 41 GB file will be deleted.

BTullis moved this task from In Progress to Done on the Data-Platform-SRE board.
BTullis added subscribers: JAllemandou, Ladsgroup.

This incident is now resolved.

We removed some datasets from /srv/published-datasets/one-off/ladsgroup which allowed the new and updated datasets to continue syncing.
(Thanks @Ladsgroup for assisting us with this.)

@JAllemandou has started a conversation about whether a maximum size policy and/or quota mechanism might be required.

BTullis moved this task from Done to In Progress on the Data-Platform-SRE board.

Reopening, since we are seeing this alert again at 95% of capacity.

image.png (939×1 px, 58 KB)

Looking at this, we do have some spare capacity on the LVS volume group.

btullis@an-web1001:~$ df -h /srv
Filesystem           Size  Used Avail Use% Mounted on
/dev/mapper/vg0-srv  1.4T  1.3T   75G  95% /srv

btullis@an-web1001:~$ sudo vgs
  VG  #PV #LV #SN Attr   VSize  VFree   
  vg0   1   3   0 wz--n- <1.82t <372.53g

btullis@an-web1001:~$ sudo lvs
  LV   VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root vg0 -wi-ao----  74.50g                                                    
  srv  vg0 -wi-ao----   1.38t                                                    
  swap vg0 -wi-ao---- 976.00m

We have 372 GB free on the volume, so I propose to add 350 GB to the vg0-srv logical volume.

I have increased the size and the volume is now at 76% of capacity.

btullis@an-web1001:~$ sudo lvresize -L +350G vg0/srv
  Size of logical volume vg0/srv changed from 1.38 TiB (362144 extents) to 1.72 TiB (451744 extents).
  Logical volume vg0/srv successfully resized.
btullis@an-web1001:~$ sudo resize2fs /dev/vg0/srv 
resize2fs 1.44.5 (15-Dec-2018)
Filesystem at /dev/vg0/srv is mounted on /srv; on-line resizing required
old_desc_blocks = 177, new_desc_blocks = 221
The filesystem on /dev/vg0/srv is now 462585856 (4k) blocks long.

btullis@an-web1001:~$ df -h /srv
Filesystem           Size  Used Avail Use% Mounted on
/dev/mapper/vg0-srv  1.7T  1.3T  406G  76% /srv

Mentioned in SAL (#wikimedia-analytics) [2023-11-29T14:01:33Z] <btullis> increased the size of the vg0/srv logical volume on an-web1001 by 350 GB for T349889

BTullis moved this task from In Progress to Done on the Data-Platform-SRE board.