Page MenuHomePhabricator

Consider moving PAWS to its own Cloud VPS project, rather than using instances inside Toolforge
Open, NormalPublic

Description

I'm considering the possibility of setting up a separate project for PAWS, with its own k8s cluster.

The tools k8s cluster is custom and specialized for making a specific use case easy. It comes with lots of custom restrictions and what not to make the lives of the tools admins more of a hell than it already is. Such restrictions include uid restrictions, lack of root, restrictions on where docker images can be pulled from, what hostpaths could be mounted, which k8s objects could be created, etc. These customizations + requirement to not break everyone also means the k8s version lags behind a bit, and upgrading isn't super easy.

For PAWS, we could possibly switch to just setting up our own project, and setting up a k8s cluster there. The advantages of this would be:

  1. Easy to set up and upgrade, since there's only one user of the cluster (PAWS) and can use standard upstream k8s installation methods (kubeadm, specifically)
  2. We can just easily reuse https://z2jh.jupyter.org, which is now an upstream project that has advanced far beyond PAWS (although it was originally based on PAWS, we never upgraded PAWS because it required new kubernetes version features + other blockers based on the tools cluster's customization). This gets PAWS users new features quickly
  3. PAWS going crazy does not affect other tools users, and vice versa.
  4. It becomes a lot less maintenance for me, since it much more closely resembles 'just another k8s cluster'.

The disadvantages would be:

  1. Less efficient resource usage. I'm not entirely sure about this, since we're on OpenStack anyway.
  2. A new project would need to also have NFS. I am actually open to running my own NFS setup with a couple of VMs (FAMOUS LAST WORDS?!) if this is a big problem.
  3. Less pooled resources put into the tools k8s cluster, since I'd be working mostly on this other cluster. Compare with #4 above tho.

Event Timeline

Restricted Application added a project: Cloud-Services. · View Herald TranscriptJun 6 2017, 2:20 AM
yuvipanda renamed this task from Consider moving PAWS to its own k8s cluster, rather than re-using Tools to Consider moving PAWS to its own k8s cluster, rather than using Tools' k8s cluster.Jun 6 2017, 6:18 AM

I would also like to use T167084 for this cluster, so knowing if that'll work with Neutron will be useful.

Note that as is with non-tools labs projects, I'll be responsible for this k8s cluster - not the labs admin team :) I'm sure they'll help when they can, ofc.

bd808 added a subscriber: bd808.Jun 6 2017, 5:29 PM

The concept of keeping PAWS in sync with upstream practices is sound. I want to add support of PAWS to the charter of the cloud-services-team as soon as we have the capacity to actually do it well. Even when that happens the product will be better served by matching standard practices in the JupyterHub community.

Can we get some estimates of the initial quota this project would need?

Andrew added a subscriber: Andrew.Jun 6 2017, 5:32 PM

I don't have a strong opinion about this. Running on tools = eating our own dogfood, which is often useful... but having a more canonical k8s cluster to bang on seems also useful.

And of course the idea of this being Yuvi's problem and not ours is nice, but it also supposes eternal Yuvi availability, which would be quite a commitment.

@Andrew right. However, it's already the case tho - PAWS right now is still mostly reliant on me, for mostly resourcing reasons. The way I'd do this is to make it quite easy for people to just follow kubeadm upstream tutorials on setting up on labs (maybe even make it into a wikitech page) so other people who might want to use it can. I also believe that kubeadm is the correct long term solution for both tools and prod, so more people playing with that doesn't sound bad...

@bd808 re: quota, I'd think if we move this we can kill some of the tools k8s workers. There's currently 185 pods running in PAWS, and 235 pods running on 'everything else in tools'. tools currently has 27 working nodes (2 nodes were DOA when we made them when I was still there, we should delete those!). I'd think we can reduce tools' node count by like...10 nodes and have those on the PAWS cluster? How does that sound? We'll ofc begin with a smaller quota while I make sure I can get a kubeadm setup cluster going quickly, and then when we're ready to flip the switch add more nodes.

@madhuvishy if this happens, I'll also need to transfer the entire contents of the paws tool dir on tools NFS to this new project's NFS. And you (I think?) need to be ok with the paws project getting NFS enabled :)

@yuvipanda Yeah that's fine by me. PAWS currently uses 51GB in the tools share, and we've talked about moving it to it's own share previously. I'm leaning towards doing that, and have the share be available for the paws project instead of tools - which would be ideal. If we decide to use the misc labs nfs storage instead, we have about 3T free there and I don't foresee a space constraint, at least in the near future.

@madhuvishy +1, I'd love it to be a separate share!

It looks like everyone's onboard with this plan, so I'll start poking at it in a week or so.

Small thought to which I'm not terribly attached but it is worth mentioning:

Tools as an ecosystem is more than a particular k8s cluster etc involving aptly, monitoring, orchestration, awareness and security particulars, etc. A whole list of managed environment things like noticing when it's broken (hopefully). Is it worth setting up this new cluster within the Tools project and just having it be detached control plane wise from the general use Tools k8s cluster? I've been thinking about this all morning and it may be a viable path.

One of the things I'd like to do is to use an nginx ingress directly for getting traffic into the cluster, instead of using the tools-proxy machinery. My plan is to totally not use puppet at all and try a coreos type setup - you set up an image once with cloud-init type thing, and then there are no changes to it ever (except base puppet in our case, which won't have any roles related to k8s). You just make new instances for upgrades, and run everything in containers.

I don't mind doing this inside tools, but I very much don't want to use any of the tools infrastructure we have :D I would just like it to be be a bog standard kubeadm based k8s cluster, and integrating with any of the current tools setups makes that hard. If you'd still like me to put this in tools am happy to, but because it should be totally separate I think it might be conceptually easier to put it in the paws project.

I was thinking a separate nginx ingress in total, and ignoring tools-proxy-xx here for sure. This is a consideration of Tools the environment being bigger than one implementation inside of it, etc. If you don't want to go with Puppet then it's a non-issue. We aren't ready to be that cool atm. I was thinking more detached than it came across, but if you go far enough adrift what's the point of course :) It's your call. Maybe at some point it will be a PITA either way chosen and changing it up will make sense. I don't feel too strongly about it as it's outlined.

On chatting more, if I have to use puppet then using the tools puppetmaster will make my life easier. So I'm going to prototype this on the paws project and move it to tools if using the puppetmaster will make my life easier in any way.

https://phabricator.wikimedia.org/T168039 for an IP quota increase :)

I now have a fully functional (including ingress and automatic let's encrypt) kubernetes cluster on the paws project. Master and Ingress node (that'll get the public IP) is paws-master-01, and paws-node-1001 is a worker node.

I've a working cluster on the PAWS project now! https://paws.deis.youarenotevena.wiki/hub/login :D

I think next steps are:

  1. Figure out if I wanna keep this in the PAWS project of do this inside tools. I'll think about this for a while. We can probably re-use the clush setup from inside tools, so we could keep this in tools. Just need to figure out if there'll be any downsides.
  2. Add proper MediaWiki auth support to the jupyterhub helm chart
  3. ???

Going to keep it inside tools!

Change 360397 had a related patch set uploaded (by Yuvipanda; owner: Yuvipanda):
[operations/puppet@production] tools: Add paws nodes to clush

https://gerrit.wikimedia.org/r/360397

bd808 edited projects, added Kubernetes; removed Toolforge.Jul 28 2017, 11:01 PM
yuvipanda closed this task as Resolved.Aug 1 2017, 7:04 AM
yuvipanda claimed this task.

Yup, new cluster!

Change 360397 merged by Dzahn:
[operations/puppet@production] tools: Add paws nodes to clush

https://gerrit.wikimedia.org/r/360397

Chicocvenancio changed the status of subtask T160113: Move PAWS nfs onto its own share from Open to Stalled.Feb 25 2018, 9:23 PM
GTirloni changed the status of subtask T160113: Move PAWS nfs onto its own share from Stalled to Open.Mar 23 2019, 9:41 PM
GTirloni reopened this task as Open.Mar 25 2019, 7:27 PM
GTirloni triaged this task as Normal priority.
GTirloni moved this task from Inbox to Needs discussion on the cloud-services-team (Kanban) board.

Re-opening since the work that was being carried out to enable this (in T211096) needs discussion.

Re-opening since the work that was being carried out to enable this (in T211096) needs discussion.

As I understand it, the topic to be discussed is the reversal of this decision from late 2017:

Going to keep it inside tools!

More specifically, there are pros and cons to project isolation. Sadly discussion of those pros/cons in 2017 was not captured here for easy review. But hey no problem, we just need to have that discussion from first principles. Here are some things to get that discussion started:

Pros

  • Follows model of Quarry/Toolforge separation
  • Better resource isolation between Toolforge and PAWS (for example NFS share separation)
  • Security realm isolation between Toolforge and PAWS which could allow "adminship" of PAWS without also requiring full-root in Toolforge
  • Current entanglement of PAWS and Toolforge is confusing to Toolforge admins

Cons

  • Lack of clearly defined monitoring/support process in isolation from Toolforge
  • Potentially signals to end-user community that Cloud Services is "giving up" on PAWS
bd808 renamed this task from Consider moving PAWS to its own k8s cluster, rather than using Tools' k8s cluster to Consider moving PAWS to its own Cloud VPS project, rather than using instances inside Toolforge.Mar 25 2019, 11:43 PM
bd808 removed a project: Tools-Kubernetes.
bd808 removed a subscriber: madhuvishy.
GTirloni removed yuvipanda as the assignee of this task.Mar 26 2019, 4:43 PM

Discussed at 2019-03-26 WMCS team meeting. Next step decision was for @GTirloni to discuss pros/cons with @Chicocvenancio and report back to the WMCS group.

I discussed a bit with @GTirloni on Tuesday. In a nutshell, I aprove the move and think it will be a net improvement for a few reasons, even though I do think allowing PAWS to use toolforge k8s is a worthwhile goal for the future (and this moves us further away from that).

In my view nfs separation is a big plus, I do think toolforge NFS is something that should not be as exposed as PAWS storage is. This could be done with a separate share (with potentially different limits and the possibility to unmount on a emergency) on the same project, but a different project achieves the same result.

One thing that worries me a bit is dumps access in a sane way, but I think we can figure it out.

bd808 added a comment.Mar 28 2019, 1:41 AM

One thing that worries me a bit is dumps access in a sane way, but I think we can figure it out.

We can expose dumps to the PAWS project in the same way that it is exposed to Toolforge and other Cloud VPS projects. There is nothing special about how Toolforge mounts the NFS share from the dumps NFS server.

One thing that worries me a bit is dumps access in a sane way, but I think we can figure it out.

We can expose dumps to the PAWS project in the same way that it is exposed to Toolforge and other Cloud VPS projects. There is nothing special about how Toolforge mounts the NFS share from the dumps NFS server.

How would they be exposed exactly? The same as Toolforge, (two NFS shares mounted with one them pointed from a symbolic link)? IE, will the hack defined in T192214 still work?

Bstorm added a subscriber: Bstorm.Apr 1 2019, 8:06 PM

How would they be exposed exactly? The same as Toolforge, (two NFS shares mounted with one them pointed from a symbolic link)? IE, will the hack defined in T192214 still work?

That's how we expose dumps across VPS projects. It would have the exact same symlinks. Dumps is a very popular mount point.

Cool. Seems like a net positive to move.

GTirloni removed a subscriber: GTirloni.Apr 3 2019, 10:20 AM

@Bstorm do we have a timeline on the NFS side of things? I would want to have an updated version of PAWS (hopefully capacity tested) before the Wikimedia-Hackathon-2019 next month. If it is not viable to do this move before then I would do the upgrades in place (though I'll then need root time to update k8s in tools).

The NFS bit is partly about moving to a new project (as it is staged now), which is blocked by a bug still from cert manager and a lot of puppetization. So if we complete the NFS part, it won't show up in PAWS for a while anyway. The NFS piece is the easy bit. At this time, I don't think there is any way I'll have time to do the rest of it by then. Anything can happen, but I don't think we should plan on that with a hard timeline.

It'd be easier for me to help upgrade in place, which will also keep reminding me to check the bug and squeeze in work on getting the project set up right. Let me know what you need, make me tickets and bump me on IRC! I'll try to get you set for the hackathon.

That said, I can get the nfs up soon within the project (maybe even next week--because I am off tomorrow).