Users do not access tools (exec or web) hosts directly, and must use a bastion host to control their jobs on the grid.
Historical reference on issues in Tools: https://etherpad.wikimedia.org/p/T100160
We currently have several bastions:
> tools-bastion-05 - general login and SGE interfacing
> tools-bastion-02 - general development, testing and SGE interfacing (Trusty)
> tools-precise-dev - general development, testing and SGE interfacing (Precise)
> tools-bastion-03 - new xlarge yuvi created to alleviate for general login and SGE interfacing
> tools-bastion-10 - testing for bastion setup
> tools-bastion-11 - testing for bastion setup
> tools-bastion-01 - needs to be deleted post security investigation
Bad outcomes from our approach:
* tools-bastion-05 is often used for development work or resource intensive jobs which prevents users who only need interface with SGE. This means no launching or monitoring of their jobs because another user is doing something intensive.
* tools-bastion-02 and tools-precise-dev are underutilized most of the time as it is poorly advertised and occasionally unusable from really expensive user load
* tools-bastion-03 isn't well known yet and while the extra head room is good, it's still easy for our use patterns to overwhelm and render moot.
We have public static urls for accessing general SGE and development:
tools-login.wmflabs.org - directs to tools-bastion-05
tools-dev.wmflabs.org - directs to tools-bastion-02
Resource contention exists for a large array of concerns but primarily:
* NFS capacity
* NFS usage (FD allocation etc)
* Local storage capacity
* Local storage usage (FD allocation etc)
This contention for resources exists in two capacities: users who are acting as themselves, and users are acting in the stead of a tool.
Because of how NFS is mounted there is no separation in either logging or quota allotment between user home directories and tool data directories. `/project/tools/project` and `/project/tools/home`. So if I want to breakdown where our NFS usage is currently it is not easy as these are part of the same export.
The reason /home and /data/project are the same isn't happenstance, it's that the underlying mechanism is the same. This raises the other problem that comes from this in that user home data has no separation at the storage layer from operational data for tools. If a user were to create a file large enough it would crash every running tool. Or if a tool were to create a file large enough it would cause issues for users.
Mechanisms for ensure resource allocation is sane:
* limits.conf (pam_limits.so)
* restricted shell (rbash or lshell)
We have been using tc for a few months now to ensure that a single host cannot crash our entire NFS setup (as has been the case for a long time). This moves the issue closer to source (on the host in question) and helps prevent cascading catastrophic failure we have seen often to this point. This does mean tools on a host share a common quota, but this is already true of the other finite resources on the `Resource contention` list. At the moment host level allotments are our smallest granularity of resource pooling. Hopefully, this becomes more sane with k8s. As part of using tc to prevent NFS reads for overwhelming the server use the [[ http://www.linuxfoundation.org/collaborate/workgroups/networking/ifb | IFB ]] kernel module to redirect inbound traffic for shaping. Because NFS connections are long lived and response well to shaping it's a viable approach. This is how we do bidirectional restriction for NFS traffic currently.
Into the future I am proposing:
* A bastion solely for the purpose of SGE interaction for users. This grants users a restricted shell both as themselves and as their tool users allowing inspection and interaction with the SGE grid. This is achieved through resource restriction for memory per process (cgroups), CPU scheduling weighted scheduling fairness (cgroups), user limits like ulimit and FD allocation limitations (limits.conf), NFS quota allocation and capping (tc and cgroups), and resource usage tracking via cgroups, iptables, and tc.
* A bastion solely for dev work that is Trusty (tools-dev.wmflabs.org and tools-dev-trusty.wmflabs.org). This host has similar tc enforced NFS allocation, CPU scheduling fairness, and basically much higher limits for any resource restrictions and //no restricted shell//.
* A bastion solely for dev work that Precise (tools-dev-precise.wmflabs.org). This host has similar tc enforced NFS allocation, CPU scheduling fairness, and basically much higher limits for any resource restrictions and //no restricted shell//.
** It's possible we should use a group of limited bastions behind an haproxy host allocating new users to the most load appropriate place, but I believe at our current levels of usage (approx 20 - 30 concurrent users usually) we should able to get by with a single xlarge instance.
The overarching idea is that users need to accomplish resource intensive tasks, but this should be contained in a such a way that other users can still perform functions we consider necessities regarding management of their tools.
Considerations that affect our mechanisms for ensure resource allocation can function by method (to be made links to a comment on where things stand for each). We are limited to Trusty or Precise here as having packages for SGE. Trusty is the main bastion, in the future we we move off of SGE Debian (Jessie) will take this role. This has some implications because systemd is not a first class citizen on Trusty, and the cgroup integration is a little haphazard (I think).
* [[ https://phabricator.wikimedia.org/T131541#2170625 | cgroups ]]
* [[ https://phabricator.wikimedia.org/T131541#2170660 | limits.conf ]] (pam_limits.so)
* [[ https://phabricator.wikimedia.org/T131541#2170762 | restricted shell ]] (rbash or lshell)
* [[ https://phabricator.wikimedia.org/T131541#2170836 | tc ]]
* (user space oriented commands) trickle, cpulimit, etc