Page MenuHomePhabricator

Move PAWS nfs onto its own share
Closed, ResolvedPublic

Description

So people can't accidentally fill up tools by filling up PAWS

Event Timeline

bd808 moved this task from Triage to Storage on the Cloud-Services board.Mar 26 2017, 9:02 PM
Chicocvenancio changed the task status from Open to Stalled.Feb 25 2018, 9:23 PM
Chicocvenancio triaged this task as Medium priority.
GTirloni changed the task status from Stalled to Open.Mar 23 2019, 9:41 PM
GTirloni removed a subscriber: GTirloni.
GTirloni closed this task as Resolved.Mar 25 2019, 7:18 PM
Bstorm added a subscriber: Bstorm.Mar 25 2019, 7:20 PM

I don't think this is resolved unless we also move paws out of tools. This doesn't cross mount separate NFS into the tools paws cluster.

GTirloni reopened this task as Open.Mar 25 2019, 7:27 PM
Bstorm added a comment.Apr 5 2019, 7:17 PM

Apr 5 19:12:11 labstore1005 nfs-exportd[7797]: exportfs: Failed to stat /exp/project/paws: No such file or directory

Just a note that the mountpoint isn't on the standby yet, until I get around to creating it :)

bd808 edited projects, added Data-Services; removed Cloud-Services.Jan 2 2020, 9:21 PM
bd808 moved this task from Backlog to Shared Storage on the Data-Services board.
Bstorm claimed this task.Apr 28 2020, 4:44 PM

Apr 5 19:12:11 labstore1005 nfs-exportd[7797]: exportfs: Failed to stat /exp/project/paws: No such file or directory

Just a note that the mountpoint isn't on the standby yet, until I get around to creating it :)

This is now fixed by virtue of the bindmounts no longer existing since https://gerrit.wikimedia.org/r/c/operations/puppet/+/571821

We need to sync the paws-user-homes and related materials to the paws project when things are ready to cut over.

Beyond the user homes, actually, it should be noted that we have:

paws/           paws-beta/      paws-dev/       paws-public/    paws-published/ paws-stats/     paws-status/    paws-support/

User homes are at labstore100[45]:/srv/tools/shared/tools/project/paws/userhomes/
The failsafe jupyterhub sqlite file is in labstore100[45]:/srv/tools/shared/tools/project/paws/db/ and needs a persistent volume created in the cluster for it.
paws-beta isn't used apparently. paws-dev appears to be historical. paws-published seems aspirational. paws-public is the homedir of the paws-public tool. This mostly holds the required yaml manifest and readme to launch paws-public. That needs to go in the GitHub repo.

paws-status seems to have been a planned app that isn't there. On the other hand paws-support actually does host something: https://tools.wmflabs.org/paws-support/ There is nothing about that which needs to move with paws from what I can tell. It's just to get a package online.

The failsafe jupyterhub sqlite file is in labstore100[45]:/srv/tools/shared/tools/project/paws/db/ and needs a persistent volume created in the cluster for it.

This doesn't seem to actually be used in the cluster. Maybe it should be!

So right now, paws/usershomes is 218G.

/srv/misc has 996G available (good), but that means it is at 80% use (not so good). Copying them over shouldn't move that needle much, but I'm now wondering if there is anything we can clean up first.

Change 605705 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] nfs monitoring: fix the broken paths for the directory size monitor

https://gerrit.wikimedia.org/r/605705

bd808 added a subscriber: bd808.Jun 16 2020, 8:34 PM

So right now, paws/usershomes is 218G.

/srv/misc has 996G available (good), but that means it is at 80% use (not so good). Copying them over shouldn't move that needle much, but I'm now wondering if there is anything we can clean up first.

Sadly I think this is the same classic problem as the Toolforge NFS share. No hard quotas and no automatic purge means that files "leak" and are never recovered. There are currently 2823 distinct user directories and within them 1.3M files with an mtime of 365 days or more. For directory size, Matias zapata seems to be the 'winner' with 58G in their home which all seems to be due to a copy of the 2020-02-20 enwiki article dump stored as both a 16G compressed file and a 42G expansion of that same file.

Change 605705 merged by Bstorm:
[operations/puppet@production] nfs monitoring: fix the broken paths for the directory size monitor

https://gerrit.wikimedia.org/r/605705

After fixing and running the prometheus monitor, the top projects are:
node_directory_size_bytes{directory="/srv/misc/shared/video/project"name="misc_project"} 5671084032
node_directory_size_bytes{directory="/srv/misc/shared/deployment-prep/project"name="misc_project"} 6316769280
node_directory_size_bytes{directory="/srv/misc/shared/editor-engagement/project"name="misc_project"} 7366733824
node_directory_size_bytes{directory="/srv/misc/shared/wikidata-query/project"name="misc_project"} 7635476480
node_directory_size_bytes{directory="/srv/misc/shared/wikidata-dev/project"name="misc_project"} 11617595392
node_directory_size_bytes{directory="/srv/misc/shared/testlabs/project"name="misc_project"} 13154086912
node_directory_size_bytes{directory="/srv/misc/shared/huggle/project"name="misc_project"} 28119285760
node_directory_size_bytes{directory="/srv/misc/shared/analytics/project"name="misc_project"} 45218971648
node_directory_size_bytes{directory="/srv/misc/shared/math/project"name="misc_project"} 80661999616
node_directory_size_bytes{directory="/srv/misc/shared/bots/project"name="misc_project"} 180541407232
node_directory_size_bytes{directory="/srv/misc/shared/quarry/project"name="misc_project"} 231540453376
node_directory_size_bytes{directory="/srv/misc/shared/wikidumpparse/project"name="misc_project"} 254470127616
node_directory_size_bytes{directory="/srv/misc/shared/dumps/project"name="misc_project"} 2561104519168

dumps project has 2.4 TB
That is wildly more than any other project on misc. I seem to recall that project has some kind of cached files or something that need regular cleanup.

Mentioned in SAL (#wikimedia-cloud) [2020-06-24T21:45:23Z] <bstorm> doing an initial rsync of the paws userhomes to the new project T160113

Bstorm closed this task as Resolved.Aug 7 2020, 6:23 PM

Final rsync is done and the old cluster is shut down.