Page MenuHomePhabricator

Jupyter and Analytics Client Enhancements Phase 1: enable shared home directories on the stat servers umbrella task
Open, In Progress, MediumPublic

Description

As part of our ongoing efforts to improve the Jupyter and Analytics Client user experience , we will enable shared home directories on the stats servers.

Creating this ticket to track all sub-tasks for Phase 1.

The following is a list of anticipated steps required.

  • Create a ceph file system named home with data stored on HDD and metadata stored on SSD
  • Create a suitable cephx client user, with rights appropriate to access this file system
  • Determine what quota settings will be appropriate for this file system
  • Distribute the cephx key to a test server
  • Mount the home volume to /home on ml-lab1002
  • Add puppet management of the /home mount point and cephx key
  • Verify that puppet can create and populate the user's home directories correctly
  • Ensure that we have adequate alerting and performance analysis available via dashboards
  • Perform some performance and resilience testing of the home directories
  • Mount /home on a single stat server - informing users that their existing homes are still available in /srv/home
  • Assess user experience and solicit feedback on roll-forward roll-back options
  • Assuming feedback is positive, continue to mount /home on the remaining stat servers and ml-lab1001

Event Timeline

bking changed the task status from Open to In Progress.Oct 31 2024, 3:54 PM
bking claimed this task.
bking triaged this task as Medium priority.
bking renamed this task from Jupyter and Analytics Client Enhancements Phase 1: enable shared home directories on the stat servers to Jupyter and Analytics Client Enhancements Phase 1: enable shared home directories on the stat servers umbrella task.Oct 31 2024, 8:07 PM
BTullis mentioned this in Unknown Object (Task).Nov 14 2024, 2:55 PM

As per T380279: Move Lab machines into analytics net for DL access and switch to homedirs on Ceph - we have decided to extend the scope of this from just the stat servers to the 'ml-lab` servers as well.

In addition, we will use ml-lab1002 as the first test server, from where /home will be mounted from cephfs.

We need to wait until ml-lab1002 has been reimaged and moved to the analytics VLAN, but we can get on with making the file system and the cephx user in the meantime.

We have created a new file system for the dumps project, so there are some relevant guides here: T352650#10344155

Note that this will be our third Ceph file system. We have a total of five MDS services running at the moment (one per cephosd100* server).
With three file systems running, this means that we will have 3 active MDS daemons and 2 standby. This will be OK, but it is at the limit of what we can do, unless we start running more MDS daemons.

I've created the cephs.home.meta and cephs.home.data pools as required by step 1 (ref https://wikimedia.slack.com/archives/C055QGPTC69/p1732203662967269 )

I have created two new cephx keys with:

btullis@cephosd1001:~$ sudo ceph fs authorize home client.ml_lab / rw
btullis@cephosd1001:~$ sudo ceph fs authorize home client.stat / rw

These key are visible in the output from:

btullis@cephosd1001:~$ sudo ceph auth list

Each of these users has read/write access to the root directory.

I will follow up with a puppet patch with the caps and keydata that were generated. Unfortunately, our current puppet mechanism is a bit off in terms of puppeting the keydata into the private repo first.

Change #1104603 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Move some cephosd hieradata into profile default files

https://gerrit.wikimedia.org/r/1104603

Change #1104603 merged by Btullis:

[operations/puppet@production] Move some cephosd hieradata into profile default files

https://gerrit.wikimedia.org/r/1104603

Change #1104616 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] cephosd: enable the deployment of client cephx keys and minimal ceph.conf

https://gerrit.wikimedia.org/r/1104616

Change #1104620 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] cephosd: move the auth keydata into profile default

https://gerrit.wikimedia.org/r/1104620

Change #1104620 merged by Btullis:

[labs/private@master] cephosd: move the auth keydata into profile default

https://gerrit.wikimedia.org/r/1104620

Change #1104621 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] ml-lab: Add a cephx key and minimal ceph.conf to ml-lab servers

https://gerrit.wikimedia.org/r/1104621

Change #1104616 merged by Btullis:

[operations/puppet@production] cephosd: enable the deployment of client cephx keys and minimal ceph.conf

https://gerrit.wikimedia.org/r/1104616

Change #1104621 merged by Btullis:

[operations/puppet@production] ml-lab: Add a cephx key and minimal ceph.conf to ml-lab servers

https://gerrit.wikimedia.org/r/1104621

Change #1104650 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] cephosd: Open the ceph daemon ports to the ANALYTICS_NETWORKS

https://gerrit.wikimedia.org/r/1104650

Change #1104650 merged by Btullis:

[operations/puppet@production] cephosd: Open the ceph daemon ports to the ANALYTICS_NETWORKS

https://gerrit.wikimedia.org/r/1104650

Gehel removed bking as the assignee of this task.Apr 1 2025, 3:46 PM
Gehel moved this task from Incoming to Misc on the Data-Platform-SRE board.