Page MenuHomePhabricator

zoomviewer uses an unreasonable amount of disk space
Open, Needs TriagePublic

Description

1.2T    /srv/tools/project/zoomviewer

zoomviewer is currently using a double-digit percentage of the storage space allocated to all ~3,500 tools on Toolforge. Please reduce that to a more reasonable amount of space.

previous work: T285018: zoomviewer taking up a lot of NFS space -- please clean up T248188: Zoomviewer has ~450,000 files in NFS home directory

Event Timeline

I'll get on it. On a philosophical level I wonder what "reasonable" is. I agree that 1.2T is a lot, but it is by nature a storage hungry tool that processes many images on Commons, and I would argue that many of the 3,500 tools have very different storage requirements. Anyhow. I can check on my cron job (that I thought I had set up) to delete files based on atime...

Does that NFS mount support atime?

I'll get on it. On a philosophical level I wonder what "reasonable" is. I agree that 1.2T is a lot, but it is by nature a storage hungry tool that processes many images on Commons, and I would argue that many of the 3,500 tools have very different storage requirements. Anyhow. I can check on my cron job (that I thought I had set up) to delete files based on atime...

Does that NFS mount support atime?

Our NFS volumes are mounted with 'noatime' so it seems we do not. I don't have the history of that decision close at hand but I'm reluctant to rock the boat when it comes to NFS.

I reduced the expiry time to 30 days.

Also, I fixed a bug causing originals to be deleted less than 1 day after download.

Previously, pyramids were 221GB and originals were 17GB. Now pyramids are 107GB but we can expect originals to grow to a multiple of that figure, due to the bug fix. If that's a problem, the expiry time can be reduced further.

Pyramids are now 294GB and originals are 971GB.

I confirmed that the cron job is working correctly. That is 30 days worth of files.

I reduced the cache expiry time to 15 days (commit).

On a philosophical level I wonder what "reasonable" is. I agree that 1.2T is a lot, but it is by nature a storage hungry tool that processes many images on Commons, and I would argue that many of the 3,500 tools have very different storage requirements.

It sounds like Taavi would like it to be a single-digit percentage. Current disk usage across all tools is 7.7TB so usage of 749GB would round down to 9%. A 15 day expiry time should get us under that for a while.

Alternatively, we could increase the storage usage of all the other tools, to make zoomviewer seem less bad.

Just joking. But the tool is presumably several times cheaper than productionization (T77151). Swift storage requires a lot of replication and thumbnails traditionally have no expiry.

Does that NFS mount support atime?

Mounting the whole volume with atime would probably not be practical.

It's now 1.3TB. And most of that is from the last 7 days:

tools.zoomviewer@tools-bastion-13:~/public_html/cache$ for d in $(seq 15); do echo -n "$d " ; find -mtime "$d" -print0 | du -csh --files0-from - | grep total ; done
1 120G  total
2 128G  total
3 108G  total
4 130G  total
5 151G  total
6 142G  total
7 105G  total
8 20G   total
9 21G   total
10 25G  total
11 24G  total
12 19G  total
13 20G  total
14 25G  total
15 20G  total

Maybe we've got an abuse problem.

Hm, only the pyramids should need to be retained. This amount of growth indicates new images being accessed. A look at the logs might be in order. Is this coming from a small number of IPs?

Hm, only the pyramids should need to be retained.

Right now, it depends on originals being retained.

This amount of growth indicates new images being accessed. A look at the logs might be in order. Is this coming from a small number of IPs?

I'm just looking at toolforge webservice logs which doesn't have IP addresses. But the AI-related crawler traffic we're seeing on other sites generally comes from many IP addresses. User-Agent strings are randomly selected from a list, and the crawler blindly follows links, ignoring robots.txt and the meta robots tag.

That command only gives 3 hours of logs, and those log entries seem legitimate right now.

Here's a better one-liner for disk space usage broken down by date:

$ find -printf "%TF %s\n" | awk '{sizes[$1] += $2;total += $2} END {for (d in sizes) {print d,sizes[d]/2^30 | "sort"} print "total: ", total/2^30}'
total:  1308
2025-05-10 14.5708
2025-05-11 21.3571
2025-05-12 17.4406
2025-05-13 22.4436
2025-05-14 21.7098
2025-05-15 22.3606
2025-05-16 24.7735
2025-05-17 18.8676
2025-05-18 54.5284
2025-05-19 152.689
2025-05-20 152.109
2025-05-21 139.529
2025-05-22 96.5159
2025-05-23 144.987
2025-05-24 80.3807
2025-05-25 187.587
2025-05-26 136.155

The trend continued in the last 3 days, adding 414GB and purging 56GB for a net increase of 353GB.

I ran a one-off purge of files older than 7 days, reducing the current size to 1048GB.

With the current level of traffic, the expiry time would need to be 5 days to maintain disk usage under 749GB.

Removing originals would help a lot, but the code depends on the originals just for their timestamps which are used to determine whether the pyramid files are valid.

Storing the entire original file just for a timestamp is pretty wasteful. I'm sure we can come up with a better solution...

Yeah, I don't see why we wouldn't be able to take the modification date of the pyramid instead.

The /data/project/zoomviewer directory has grown by about a terabyte in the two weeks since I filed this task. Is the cleanup script working as expected? Can I help with this somehow?

The /data/project/zoomviewer directory has grown by about a terabyte in the two weeks since I filed this task. Is the cleanup script working as expected?

Yes, it's working the same as always. We're still getting 6x as much traffic as usual so we will need a 6x shorter expiry time. I reduced the expiry time to 5 days and ran the script, so it's now down to 793 GB.

Can I help with this somehow?

Is it possible to identify and block crawlers that are hitting this tool?

Yes, it's working the same as always. We're still getting 6x as much traffic as usual so we will need a 6x shorter expiry time. I reduced the expiry time to 5 days and ran the script, so it's now down to 793 GB.

How hard would it be to change the expiration script to set a cap on storage usage instead of a hard-coded retention time?

Also, related to your comment about using Swift earlier: the storage costs of using our object storage service would generally be comparable or even a bit lower compared to the current NFS volume (which are both backed by the same Ceph cluster already). Migrating the zoomviewer storage there would mean that we could give it a separate quota and not have to worry about it consuming the storage that's shared with all other Toolforge tools.

Can I help with this somehow?

Is it possible to identify and block crawlers that are hitting this tool?

There were a couple of clients doing 10x traffic to the next biggest clients that I could easily block, so I did that.