Tools bastions are often unreliable
Closed, ResolvedPublic

Description

Users do not access tools (exec or web) hosts directly, and must use a bastion host to control their jobs on the grid.

Historical reference on issues in Tools: https://etherpad.wikimedia.org/p/T100160

We currently have several bastions:

tools-bastion-05 - general login and SGE interfacing
tools-bastion-02 - general development, testing and SGE interfacing (Trusty)
tools-precise-dev - general development, testing and SGE interfacing (Precise)

tools-bastion-03 - new xlarge yuvi created to alleviate for general login and SGE interfacing

tools-bastion-10 - testing for bastion setup
tools-bastion-11 - testing for bastion setup

tools-bastion-01 - needs to be deleted post security investigation


Bad outcomes from our approach:

  • tools-bastion-05 is often used for development work or resource intensive jobs which prevents users who only need interface with SGE. This means no launching or monitoring of their jobs because another user is doing something intensive.
  • tools-bastion-02 and tools-precise-dev are underutilized most of the time as it is poorly advertised and occasionally unusable from really expensive user load
  • tools-bastion-03 isn't well known yet and while the extra head room is good, it's still easy for our use patterns to overwhelm and render moot.

We have public static urls for accessing general SGE and development:

tools-login.wmflabs.org - directs to tools-bastion-03
tools-dev.wmflabs.org - directs to tools-bastion-02

Resource contention exists for a large array of concerns but primarily:

  • NFS capacity
  • NFS usage (FD allocation etc)
  • CPU
  • Memory
  • Local storage capacity
  • Local storage usage (FD allocation etc)

This contention for resources exists in two capacities: users who are acting as themselves, and users are acting in the stead of a tool.

Because of how NFS is mounted there is no separation in either logging or quota allotment between user home directories and tool data directories. /project/tools/project and /project/tools/home. So if I want to breakdown where our NFS usage is currently it is not easy as these are part of the same export.

See: http://graphite.wmflabs.org/render/?width=1104&height=515&_salt=1459535539.979&target=tools.tools-bastion-05.nfsiostat.mounts.*.write.kilobytes&from=-7d&graphType=pie

The reason /home and /data/project are the same isn't happenstance, it's that the underlying mechanism is the same. This raises the other problem that comes from this in that user home data has no separation at the storage layer from operational data for tools. If a user were to create a file large enough it would crash every running tool. Or if a tool were to create a file large enough it would cause issues for users.

Mechanisms for ensure resource allocation is sane:

  • cgroups
  • limits.conf (pam_limits.so)
  • restricted shell (rbash or lshell)
  • tc
  • iptables

We have been using tc for a few months now to ensure that a single host cannot crash our entire NFS setup (as has been the case for a long time). This moves the issue closer to source (on the host in question) and helps prevent cascading catastrophic failure we have seen often to this point. This does mean tools on a host share a common quota, but this is already true of the other finite resources on the Resource contention list. At the moment host level allotments are our smallest granularity of resource pooling. Hopefully, this becomes more sane with k8s. As part of using tc to prevent NFS reads for overwhelming the server use the IFB kernel module to redirect inbound traffic for shaping. Because NFS connections are long lived and response well to shaping it's a viable approach. This is how we do bidirectional restriction for NFS traffic currently.


Into the future I am proposing:

  • A bastion solely for the purpose of SGE interaction for users. This grants users a restricted shell both as themselves and as their tool users allowing inspection and interaction with the SGE grid. This is achieved through resource restriction for memory per process (cgroups), CPU scheduling weighted scheduling fairness (cgroups), user limits like ulimit and FD allocation limitations (limits.conf), NFS quota allocation and capping (tc and cgroups), and resource usage tracking via cgroups, iptables, and tc.
  • A bastion solely for dev work that is Trusty (tools-dev.wmflabs.org and tools-dev-trusty.wmflabs.org). This host has similar tc enforced NFS allocation, CPU scheduling fairness, and basically much higher limits for any resource restrictions and no restricted shell.
  • A bastion solely for dev work that Precise (tools-dev-precise.wmflabs.org). This host has similar tc enforced NFS allocation, CPU scheduling fairness, and basically much higher limits for any resource restrictions and no restricted shell.
  • It's possible we should use a group of limited bastions behind an haproxy host allocating new users to the most load appropriate place, but I believe at our current levels of usage (approx 20 - 30 concurrent users usually) we should able to get by with a single xlarge instance.

The overarching idea is that users need to accomplish resource intensive tasks, but this should be contained in a such a way that other users can still perform functions we consider necessities regarding management of their tools.


Considerations that affect our mechanisms for ensure resource allocation can function by method (to be made links to a comment on where things stand for each). We are limited to Trusty or Precise here as having packages for SGE. Trusty is the main bastion, in the future we we move off of SGE Debian (Jessie) will take this role. This has some implications because systemd is not a first class citizen on Trusty, and the cgroup integration is a little haphazard (I think).

http://www.linuxfoundation.org/collaborate/workgroups/networking/ifb

chasemp created this task.Apr 1 2016, 6:59 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 1 2016, 6:59 PM
chasemp triaged this task as Normal priority.Apr 1 2016, 6:59 PM
tom29739 added a subscriber: tom29739.
chasemp updated the task description. (Show Details)Apr 1 2016, 9:04 PM
chasemp updated the task description. (Show Details)Apr 1 2016, 9:06 PM
tom29739 updated the task description. (Show Details)Apr 1 2016, 9:13 PM
Luke081515 added a subscriber: Luke081515.
chasemp added a comment.EditedApr 1 2016, 9:49 PM

> -- cgroups (trusty) --

this is primarily managed with the cgroup-bin package that includes cgroup-lite. cgroup-lite is a shim of an upstart service that (unconfigurable) manages to associate cgroup subsystems with /sys/fs/cgroup/<subsystem>. cgroup-bin is a collection of utilities for interacting with an controlling cgroups.

cgconfig.conf can be configured for managing cgroup subsystem mount association (which at a basic level cgroup-lite is already doing)

cgrules.conf can be configured with "rules" that associate processes with a particular cgroup. These rules can use variables like %u for user, but this is not dynamic. Referencing a non-existent cgroup for application at the rules level does not work, so it is difficult to use this in a per user situation as another method to create appropriate groups before this application level is required.

libpam-cgroup can parse cgrules and apply groups at login time for users (but since cgrules is not dynamic it leaves a hole where a real flexible /$group/$user association process would exist)

Even though Trusty is not booting with systemd there is shim integration and systemd is tightly coupled with cgroups. So libpam-systemd comes with the system and does maintain a hierarchy of per user and per session cgroups within the systemd subsystem. These groups are either immutable or not related to any of the other subsystems by default.

/sys/fs/cgroup/systemd/user/4610.user/5.session/

Part of this shim integration is logind (/etc/init/systemd-logind.conf) that creates this integration dynamically. So for instance on a Trusty system we can query logind for active sessions:

root@tools-bastion-10:~# loginctl list-sessions
  SESSION        UID USER             SEAT
        2          0 root
        5       4610 rush

logind is configurable here /etc/systemd/logind.conf

One of the options is to manage the association for more than the native systemd controller. If I set:

Controllers=blkio cpu cpuacct cpuset devices freezer hugetlb memory perf_event net_cls net_prio

Now I get a hierarchy of /users/id/session across subsystems in real time, however this does not clean up after itself well. If I log in as a user and then exit that session the scaffolding cgroup persists and it all gets really mess with a use case of many random users occupying multiple unique sessions over short periods of time.

So the grouping of users and groups of users dynamically is somewhat complicated. Even though we can piggyback on systemd for per user/session cgroup allocations we still still need to somehow set appropriate parameters inside of them, this mechanism does not exist yet. There is also some general wisdom that tinkering around within cgroups systemd has allocated for its purposes outside of it is prone to complication and failure. The language I have seen is that in the future systemd will be the one true arbitrator for cgroups and so changes without its knowledge within groups it manages is bad practice.

In general with cgroups I want to achieve:

Limited shell bastions:

  • Assign users acting as themselves to a lower share cpu grouping (this is relative to other cgroup share allocations). Users should be acting as their tools whenever possible.
  • Assign memory limits to both users and tools that are relatively low but facilitate all directly SGE interactions
  • Allow audit via cpuacct and memory group

Dev bastions:

  • Assign all users and tools to the same cpu.share
  • Assign all users and tools a reasonable but contained memory allocation
  • Create cgroups with a lesser cpu and memory to allow moving long running and intensive operations to them as triage in case of issue. This means the jobs run but they now get less priority.
  • Allow audit via cpuacct and memory group

Things we cannot do here:

  • We cannot classify return traffic for NFS policing (return traffic -- or read -- has no classification) and read is as or more abusive that write in our setup with multi-gig text files uses often try to open
  • We cannot easily dynamically associate moving targets like users and sessions (there is a binary that can do this in a real time way cgrulesengd that has some issues and still does not also manage the existence of dynamically referenced groups)
  • We can not use controls meant for kernels compiled with Real Time scheduler for cpu limiting https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html
chasemp updated the task description. (Show Details)Apr 1 2016, 9:50 PM

pam module that establishes resource limits and associations. So adding this to a module:

session required pam_limits.so

makes /etc/security/limits.conf settings active

There are a few limits I would like, a few I have tried, a least one that is not useful:

*               hard    maxsyslogins    250
*               hard    nice            0
*               -       priority        10
*               -       cpu             36
*               hard    nofile          4096
*               soft    nofile          1024
*               hard    nproc           10
*               soft    nproc           10

can be applied per user or group or '*' all

Limited bastions:

  • set low fd allocations (nofile)
  • set low nproc (prevents fork bombs or just bad ideas)
  • priority and nice settings
  • maxsyslogins (very high) to prevent some runaway user process from login ddos'ing

Dev bastions:

  • relatively normal fd allocation
  • mid way nproc (100?)
  • no niceness or priority other than default

TBD: I haven't ever tested max file size but it's worth considering to prevent really large file operations on the limited shell bastions, and concurrent login limits per user seem to not work reliably. I have found some chatter about upstart and lack of ability to enforce both for actively concurrent logins in progress and for login count in general.

chasemp updated the task description. (Show Details)Apr 1 2016, 10:02 PM

A pretty straight forward and simple "shell" wrapper that uses the cmd.cmd python module for interactive cli interfaces. I don't think this is a security mechanism so much as it is a self-protection mechanism for the users and general ecosystem on a server. I know there are ways to break out of this (anything we allow that shells out would do it), but this is as far down the road of technical solutions to a largely social problem as seems reasonable. Circumvention has to be an intentional act here, and since we do not have private data we are concerned with that is a fine point of demarcation to me.

Limited bastions:

  • Users get a limited shell on login
  • Users get a limited shell as a tool that can do all SGE and normal things
  • This prevents work that should be done on dev hosts as well as making the environment simple

Dev bastions:

  • None

Why not rbash or $solutions? rbash is restrictive and permissive in the wrong ways. lshell is simple, python, and allows us to create our own user experience and levels of interface. On the lookout for alternate solutions.

reading:

http://www.doknowevil.net/tag/lshell/
https://github.com/ghantoos/lshell

chasemp updated the task description. (Show Details)Apr 1 2016, 10:14 PM
chasemp updated the task description. (Show Details)Apr 1 2016, 10:19 PM
tc --

userspace controls for the linux traffic QoS mechanisms, it is somewhat opaque and often complicated and the best documentation seems to be http://www.lartc.org/lartc.html#LARTC.QDISC

We already use this on all NFS enabled Tools hosts to protect NFS, and I want to continue to do so but with some refinement.

tc works by first defining and classifying traffic with a filter and then establishing a traffic class to associate with. One of the facilities to do this is able to parse out fields within the IP header (u32) and we do this by IP for the NFS server currently. So technically we don't shape traffic to NFS we shape traffic to the NFS server. It is possible to filter by port but it was unnecessary.

tc is able to tap in to both iptables and cgroups filter enhancing mechanisms. Using net_cls we can define a filter match that funnels traffic from processes within a particular cgroup to a traffic class. Using iptables we can apply a fw mark and match in a filter based on that.

iptables -A OUTPUT -p tcp --dport 2049 -m state --state NEW,ESTABLISHED -j MARK --set-mark 0x1

can match up to:

tc filter add dev eth0 parent 1: protocol ip prio 0 handle 1 fw classid 1:2
$TC class add dev eth0 parent 1: classid 1:2 htb rate 10000kbps

However, the current way to do ingress queuing is to redirect traffic via an IFB interface and establish a queue to apply logic. This works roughly well in some cases, NFS is one of them. But IFB is applicable before netfilter which means I cannot classify return traffic using IFB through iptables. I would like to be able to apply per user classes using the owners module and container all user NFS traffic independently, but I cannot do that for ingress/read. The older mechanism of IMQ had this ability but would require compiling it into the Trusty kernel etc. I'm not sure about pursuing this yet due to complexity of maintenance. Super frustrating.

cgroup net_cls is obviously a local distinction and I'm still sorting out if there is a way leverage this for return traffic as well.

So with NFS at the moment I am limited to the relatively well tested mechanism of using fields within the IP header for classification in order to shape both egress and ingress.

To this point I would like to (on both Dev and limited bastions):

  • Move /home for tools to another IP to identify separately from /data/project
  • Classify both /data/project and /home traffic with individual quotas

:* prevents /home traffic from overwhelming project and vice versa
:* this allows tracking and monitoring distinctly as well
(* longer term move /home to another NFS share that is separated from /data/project)

chasemp updated the task description. (Show Details)Apr 1 2016, 10:40 PM

iptables

Limited bastions:

  • Because of the lack of ability with tc to do per user receive quota atm I am looking at using iptables for auditing purposes which I believe I can do per user bidirectionally.

Dev bastions:

  • None
Luke081515 moved this task from Triage to In Progress on the Cloud-Services board.Apr 5 2016, 4:29 PM
-jem- added a subscriber: -jem-.Apr 5 2016, 11:01 PM

Change 282000 had a related patch set uploaded (by Rush):
labstore svc addresses to separate mounts

https://gerrit.wikimedia.org/r/282000

Change 282000 merged by Rush:
labstore svc addresses to separate mounts

https://gerrit.wikimedia.org/r/282000

Change 282060 had a related patch set uploaded (by Rush):
nslcd specifying shell override

https://gerrit.wikimedia.org/r/282060

Change 282072 had a related patch set uploaded (by Rush):
toollabs bastions install cgroup-bin

https://gerrit.wikimedia.org/r/282072

Change 282060 merged by Rush:
nslcd specifying shell override

https://gerrit.wikimedia.org/r/282060

Change 282165 had a related patch set uploaded (by Rush):
nscld changes need to restart nscd as well

https://gerrit.wikimedia.org/r/282165

Change 282165 merged by Rush:
nscld changes need to restart nscd as well

https://gerrit.wikimedia.org/r/282165

Change 282072 merged by Rush:
toollabs bastions install cgroup-bin

https://gerrit.wikimedia.org/r/282072

Change 282759 had a related patch set uploaded (by Rush):
pam_limits tools bastion parameters

https://gerrit.wikimedia.org/r/282759

Change 282759 merged by Rush:
pam_limits tools bastion parameters

https://gerrit.wikimedia.org/r/282759

Change 284906 had a related patch set uploaded (by Rush):
labs: setting up to use cgrules engine

https://gerrit.wikimedia.org/r/284906

Change 284906 merged by Rush:
labs: setting up to use cgrules engine

https://gerrit.wikimedia.org/r/284906

Change 284909 had a related patch set uploaded (by Rush):
toollabs: bastion setup for cgred::group scripts

https://gerrit.wikimedia.org/r/284909

Change 284909 merged by Rush:
toollabs: bastion setup for cgred::group scripts

https://gerrit.wikimedia.org/r/284909

Change 284923 had a related patch set uploaded (by Rush):
remove duplicate package def for cgroup-bin

https://gerrit.wikimedia.org/r/284923

Change 284923 merged by Rush:
remove duplicate package def for cgroup-bin

https://gerrit.wikimedia.org/r/284923

Luke081515 updated the task description. (Show Details)Apr 22 2016, 5:18 PM

Change 284924 had a related patch set uploaded (by Rush):
toollabs: bastion fixup perms for cgred

https://gerrit.wikimedia.org/r/284924

Change 284924 merged by Rush:
toollabs: bastion fixup perms for cgred

https://gerrit.wikimedia.org/r/284924

Change 284925 had a related patch set uploaded (by Rush):
toollabs: bastion setup for cgred::group utilities

https://gerrit.wikimedia.org/r/284925

Change 284925 merged by Rush:
toollabs: bastion setup for cgred::group utilities

https://gerrit.wikimedia.org/r/284925

Change 284926 had a related patch set uploaded (by Rush):
toollabs: bastion setup for cgred::group user daemons

https://gerrit.wikimedia.org/r/284926

Change 284926 merged by Rush:
toollabs: bastion setup for cgred::group user daemons

https://gerrit.wikimedia.org/r/284926

Change 284927 had a related patch set uploaded (by Rush):
toollabs: bastion setup for cgred::group shell

https://gerrit.wikimedia.org/r/284927

Change 284927 merged by Rush:
toollabs: bastion setup for cgred::group shell

https://gerrit.wikimedia.org/r/284927

Change 284928 had a related patch set uploaded (by Rush):
toollabs: bastion setup for cgred::group throttle

https://gerrit.wikimedia.org/r/284928

Change 284928 merged by Rush:
toollabs: bastion setup for cgred::group throttle

https://gerrit.wikimedia.org/r/284928

Change 284978 had a related patch set uploaded (by Rush):
cgred changes for toollabs bastions use case

https://gerrit.wikimedia.org/r/284978

Change 284978 merged by Rush:
cgred changes for toollabs bastions use case

https://gerrit.wikimedia.org/r/284978

Change 285645 had a related patch set uploaded (by Rush):
toollabs limits.conf as a template

https://gerrit.wikimedia.org/r/285645

Change 285645 merged by Rush:
toollabs: Use a template for limits.conf

https://gerrit.wikimedia.org/r/285645

chasemp updated the task description. (Show Details)May 13 2016, 3:06 PM

Change 288622 had a related patch set uploaded (by Rush):
tool labs bastions tcl cgroup

https://gerrit.wikimedia.org/r/288622

Change 288622 merged by Rush:
tool labs bastions tcl cgroup

https://gerrit.wikimedia.org/r/288622

chasemp closed this task as Resolved.Jul 15 2016, 5:48 PM

Tentatively I'm closing this.

There are more things I was planning on doing if needed but we have been more or less satisfied with the cgroup-ification of these nodes combined with the xlarge instance. Further into this we start to create more tradeoffs that seem unnecessary unless we are solving a hard and fast issue.