Page MenuHomePhabricator

Move PAWS nfs onto its own share
Closed, ResolvedPublic

Description

So people can't accidentally fill up tools by filling up PAWS

Event Timeline

Chicocvenancio changed the task status from Open to Stalled.Feb 25 2018, 9:23 PM
Chicocvenancio triaged this task as Medium priority.

I don't think this is resolved unless we also move paws out of tools. This doesn't cross mount separate NFS into the tools paws cluster.

Apr 5 19:12:11 labstore1005 nfs-exportd[7797]: exportfs: Failed to stat /exp/project/paws: No such file or directory

Just a note that the mountpoint isn't on the standby yet, until I get around to creating it :)

Apr 5 19:12:11 labstore1005 nfs-exportd[7797]: exportfs: Failed to stat /exp/project/paws: No such file or directory

Just a note that the mountpoint isn't on the standby yet, until I get around to creating it :)

This is now fixed by virtue of the bindmounts no longer existing since https://gerrit.wikimedia.org/r/c/operations/puppet/+/571821

We need to sync the paws-user-homes and related materials to the paws project when things are ready to cut over.

Beyond the user homes, actually, it should be noted that we have:

paws/           paws-beta/      paws-dev/       paws-public/    paws-published/ paws-stats/     paws-status/    paws-support/

User homes are at labstore100[45]:/srv/tools/shared/tools/project/paws/userhomes/
The failsafe jupyterhub sqlite file is in labstore100[45]:/srv/tools/shared/tools/project/paws/db/ and needs a persistent volume created in the cluster for it.
paws-beta isn't used apparently. paws-dev appears to be historical. paws-published seems aspirational. paws-public is the homedir of the paws-public tool. This mostly holds the required yaml manifest and readme to launch paws-public. That needs to go in the GitHub repo.

paws-status seems to have been a planned app that isn't there. On the other hand paws-support actually does host something: https://tools.wmflabs.org/paws-support/ There is nothing about that which needs to move with paws from what I can tell. It's just to get a package online.

The failsafe jupyterhub sqlite file is in labstore100[45]:/srv/tools/shared/tools/project/paws/db/ and needs a persistent volume created in the cluster for it.

This doesn't seem to actually be used in the cluster. Maybe it should be!

So right now, paws/usershomes is 218G.

/srv/misc has 996G available (good), but that means it is at 80% use (not so good). Copying them over shouldn't move that needle much, but I'm now wondering if there is anything we can clean up first.

Change 605705 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] nfs monitoring: fix the broken paths for the directory size monitor

https://gerrit.wikimedia.org/r/605705

So right now, paws/usershomes is 218G.

/srv/misc has 996G available (good), but that means it is at 80% use (not so good). Copying them over shouldn't move that needle much, but I'm now wondering if there is anything we can clean up first.

Sadly I think this is the same classic problem as the Toolforge NFS share. No hard quotas and no automatic purge means that files "leak" and are never recovered. There are currently 2823 distinct user directories and within them 1.3M files with an mtime of 365 days or more. For directory size, Matias zapata seems to be the 'winner' with 58G in their home which all seems to be due to a copy of the 2020-02-20 enwiki article dump stored as both a 16G compressed file and a 42G expansion of that same file.

Change 605705 merged by Bstorm:
[operations/puppet@production] nfs monitoring: fix the broken paths for the directory size monitor

https://gerrit.wikimedia.org/r/605705

After fixing and running the prometheus monitor, the top projects are:
node_directory_size_bytes{directory="/srv/misc/shared/video/project"name="misc_project"} 5671084032
node_directory_size_bytes{directory="/srv/misc/shared/deployment-prep/project"name="misc_project"} 6316769280
node_directory_size_bytes{directory="/srv/misc/shared/editor-engagement/project"name="misc_project"} 7366733824
node_directory_size_bytes{directory="/srv/misc/shared/wikidata-query/project"name="misc_project"} 7635476480
node_directory_size_bytes{directory="/srv/misc/shared/wikidata-dev/project"name="misc_project"} 11617595392
node_directory_size_bytes{directory="/srv/misc/shared/testlabs/project"name="misc_project"} 13154086912
node_directory_size_bytes{directory="/srv/misc/shared/huggle/project"name="misc_project"} 28119285760
node_directory_size_bytes{directory="/srv/misc/shared/analytics/project"name="misc_project"} 45218971648
node_directory_size_bytes{directory="/srv/misc/shared/math/project"name="misc_project"} 80661999616
node_directory_size_bytes{directory="/srv/misc/shared/bots/project"name="misc_project"} 180541407232
node_directory_size_bytes{directory="/srv/misc/shared/quarry/project"name="misc_project"} 231540453376
node_directory_size_bytes{directory="/srv/misc/shared/wikidumpparse/project"name="misc_project"} 254470127616
node_directory_size_bytes{directory="/srv/misc/shared/dumps/project"name="misc_project"} 2561104519168

dumps project has 2.4 TB
That is wildly more than any other project on misc. I seem to recall that project has some kind of cached files or something that need regular cleanup.

Mentioned in SAL (#wikimedia-cloud) [2020-06-24T21:45:23Z] <bstorm> doing an initial rsync of the paws userhomes to the new project T160113

Final rsync is done and the old cluster is shut down.