We have currently a wide range of puppet roles for Analytics clients:
- stat1004 (role::statistics::explorer) - generic Hadoop client node, terabytes of disk space for users
- stat1005 (role::statistics::explorer::gpu) - generic Hadoop client node, terabytes of disk space for users, GPU (card + drivers + tools + etc..)
- stat1006 (role::statistics::cruncher) - generic data crunching node, no Hadoop client config deployed, terabytes of disk space for users, access to Eventlogging data, runs report updater jobs via systemd timers
- stat1007 (role::statistics::private) - generic Hadoop client node, terabytes of disk space for users, Report updater jobs running via systemd timers, geoip backup systemd timer.
- notebook100[3,4] (role::swap) - Jupyter Notebook hosts, low space on disk for users, originally meant to be an alternative way to access Hadoop/HDFS without storing any data on the host itself.
After the introduction of Kerberos, the differences between stat100[4,5,6,7] are not that much, so we could think about refactoring all roles into one. Open questions:
- where do we put Report Updater jobs, since other users need to access it? Should we deploy them only on some hosts if configured via puppet or hiera?
- where do we run analytics only timers/jobs?
Moreover, more people are using notebooks and they asked more space on the hosts to use them also for local computations (so not only as Hadoop clients). We could think about unifying stat-related roles with role::swap, so every stat box would have also a jupyter server on it, and drop support for notebook hosts.