Page MenuHomePhabricator

Limit paws storage
Closed, ResolvedPublic

Description

Using paws-nfs-1 as a sample, it is my understanding this is a snapshot of paws nfs usage from March of 2022. There are a total of 4227 home directories. The following chart describes their usage:

data usage> 1M> 10M> 100M> 1G> 10G
# of users8745322067812
% of users20.712.64.91.80.3
% of data99.999.797.889.356.0

Of the largest directories, 1/3 do not appear to have been used in over a year prior to this snapshot (newest file was over a year older)

Considering that PAWS does not give a user very much compute it is questionable how much value is derived from being able to have a large amount of local data. Meanwhile the cost is increased usage and reduced portability of PAWS itself. Considering that only 1.8% of users have exceeded using 1G of storage and the large amount of storage that can be recovered (we wouldn't get 89.3 percent back as people could still use up to 1G, but we would probably get 80% back) it seems reasonable to investigate if we can quota in this regard and refer people who need more to toolforge.

Event Timeline

rook updated the task description. (Show Details)

I agree that it would be useful to define limits on PAWS and clarify when to use PAWS versus migrating into Toolforge. In general, I believe NFS is the only remaining storage without quotas. This would help.

rook renamed this task from Limit paws storage? to Limit paws storage.Mar 13 2023, 1:40 PM
rook changed the task status from Open to In Progress.May 2 2023, 11:36 AM
rook claimed this task.

Hi @rook , re

For home directories who's capacity goes above 1 gigabyte, the largest files will be removed to bring the total usage below 1 gigabyte.

Will the affected users be informed and/or will there be a file left in place (e.g. <original_filename>.REMOVED with a message linking to this issue) to indicate that this has happened?

For home directories who's capacity goes above 1 gigabyte, the largest files will be removed to bring the total usage below 1 gigabyte.

The largest file in my PAWS account is ./.ipython/profile_default/history.sqlite at ~170M. This is a meta file that contains a history of commands and I would really appreciate if such meta files were excluded from deletion as this is really helpful for recovery situations.

(I just tidied my own PAWS home folder to <1G so this would likely not affect me this time, but I think it is worth to consider such situations when deleting files.)

Will the affected users be informed and/or will there be a file left in place (e.g. <original_filename>.REMOVED with a message linking to this issue) to indicate that this has happened?

I think we can add that.

The largest file in my PAWS account is ./.ipython/profile_default/history.sqlite at ~170M. This is a meta file that contains a history of commands and I would really appreciate if such meta files were excluded from deletion as this is really helpful for recovery situations.

The underlying intention is to help define what is PAWS and what is Toolforge. Which is to say PAWS is constrained to 1 cpu and 2G of ram. Such that when one starts exceeding those limits one should move to toolforge for a more complete experience. I'm reticent to exclude types of files as it increases complexity and opens the door to abuse. Moreover I would point out that none of PAWS is backed up. If there is anything anyone is worried will be a problem if it is not there, such files should be backed up to user spaces.

This makes sense to me overall with a few thoughts about how to reduce frustration on the user side. I took a look at mine and I was at ~5GB (sorry) and so removed a few larger data files that could easily be re-downloaded if needed again but found that actually almost 4GB of this was pip cache that I wasn't even aware of. Assuming this is not just me, a few thoughts:

  • The pip cache is pretty hidden -- I first did $ du -hs * and was confused because nothing was standing out as being particularly large but then I actually looked at the cache because I'd been doing some machine learning work and knew that the model files were stored there so assumed that was the issue (which luckily overlapped with the pip cache issue). All to say, you have to look for it explicitly to find it so folks who do a lot of Python work might find themselves inadvertently filling up their quota without understanding why.
  • Regarding my second guess for the issue about HuggingFace machine learning models -- it's true that PAWS has limited compute but it does have enough to make it a really nice place for showcasing how to use ML models (current example) for Wikimedia content. Might it make more sense to track storage space to at least match available RAM?
  • Personally, part of the challenge for me too is that I'm a long-time user of PAWS now so have built up a number of notebooks with small data components to them that together end up taking up a fair bit of space even if each one is quite within expected PAWS usage.
  • Potential compromises:
    • Is it possible to move some of this pip cache off the individual hosts or is that a headache / not a good security idea for some reason? If not and there's some sort of warning message that folks would get that could point them to some documentation/tips, including things like using $ du -hs ~/* and doing $ pip cache purge and checking the size of hidden folders would be useful pointers.
    • Will there be a way to request extra storage (as with Toolforge / Cloud VPS)? That would honestly solve most of my personal concerns because it sounds like from the statistics (thank you), there aren't too many of us who will be impacted.
    • Could there be a temporary space within a working session perhaps that has a much higher limit but is automatically deleted at the end of the session? That way it could be used for larger files such as huggingface models or pip libraries that are necessary within a session and fit normal use expectations but can be safely deleted and downloaded fresh in a future session -- perhaps this is by default the ~/.cache folder?

@Isaac Tracking RAM makes sense to me. How about we set the limit to about 5GB? A little more than twice the RAM limit. That should get most of the storage cleared up, and the impact should be on very few people at that point.

I agree that in many cases pip cache would be implicated in what is consuming space. I would side step the issue in that I'm not expecting anyone to actively clean up their space. I don't believe there would be harm in a script removing their pip cache if it did set them over the limit, as they should be able to pull things back down if that is the case. In general the assumption is that PAWS will clean itself up, there is no shame in filling up a home dir in it, it will fix itself.

Speaking to if there can be temporary over allocation, yes there can be. It won't be a hard quota, rather a script that goes and cleans every day. As such any user could go over the limit, and it would be cleaned up on a schedule. So one could pull down an excess of the limit, and work with it, though it would vanish. I would contend that if it is taking more than maybe ten minutes or so to pull down needed data for a project it is probably time to look at toolforge rather than PAWS.

How about we set the limit to about 5GB? A little more than twice the RAM limit.

@rook If that would still solve your issue, that sounds great to me and unlikely to cause new problems! thanks!

I don't believe there would be harm in a script removing their pip cache if it did set them over the limit

Yeah, that makes a lot of sense to me. As a long-time user, I already expect the virtual environments to need a fresh install each new session, so in some ways, it's more confusing that the cache is permanent :)

Speaking to if there can be temporary over allocation, yes there can be. It won't be a hard quota, rather a script that goes and cleans every day. As such any user could go over the limit, and it would be cleaned up on a schedule. So one could pull down an excess of the limit, and work with it, though it would vanish. I would contend that if it is taking more than maybe ten minutes or so to pull down needed data for a project it is probably time to look at toolforge rather than PAWS.

Yes, this would be perfect! Generally just need enough time to pull down the data etc. and load it into RAM and agree that it's not a bad idea to have gentle nudges like this as the processing gets heavier to move to Toolforge. Because I have access to our own internal clusters, I'm rarely doing real compute jobs on PAWS but use it much more as a tutorial space for an easy way to share workable examples of how to use a certain dataset etc. that can be easily re-run (as opposed to e.g., a gist on Github or code repo for Toolforge).

Mentioned in SAL (#wikimedia-cloud) [2023-05-15T12:00:46Z] <Rook> implemented limits to storage. Cleared about half the storage used T327936

Saw some discussion on irc about there being no announcement. In addition to this ticket a notice was sent out through cloud-announce https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/3GKGX3JQXCAB6BNC26C2UP72FTPR6PLT/