Inspired by T96706, it would be interesting to slim down our resource use to what is really needed. Only infrastructure instances require relevant local disk space. Exec/webgrid nodes mostly access NFS data only. In addition, it may be interesting to see regarding overhead and granularity if it is more useful to have instances with few or many virtual CPUs.
I think we should decline this. Resources aren't used unless they're in use, and maintaining separate tools-specific images is overhead - every time we update the image for something we've to update these too.
@Andrew, looking at your considerations regarding disk space on virtual nodes, is the required space on the virtual node defined by the possible sizes of the hosted instances or is it over-provisioned?
For example, tools-webgrid-lighttpd-1208 is a m1.large instance with 80 GByte "Allocated Storage", allegedly 0 GByte "Filled Storage" and about 60 GByte free space in the lvm thingy:
scfc@tools-webgrid-lighttpd-1208:~$ sudo vgdisplay --- Volume group --- VG Name vd System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 1 VG Access read/write VG Status resizable MAX LV 0 Cur LV 0 Open LV 0 Max PV 0 Cur PV 1 Act PV 1 VG Size 61,40 GiB PE Size 4,00 MiB Total PE 15719 Alloc PE / Size 0 / 0 Free PE / Size 15719 / 61,40 GiB VG UUID 2d0JzR-9r9H-aDJn-mb5f-BVuA-tKL6-Gtt29U scfc@tools-webgrid-lighttpd-1208:~$
Does the virtual node running this instance reserve ~ 20 GByte of disk space for it or 80 GByte? I. e., if we would use an instance with the same number of CPUs, but only 20 GByte of allocated storage, would that free 60 GByte in your plannings? (Or roughly 50 times that for all of Tools.)
Best I can tell, the nova scheduler isn't very smart about this (partly because COW is a hack that we're using but isn't really understood by the upstream.)
The effect, as I understand it, is that the scheduler compares the physically available space of the host node to the theoretical size of the requested instance. So, when scheduling a 20Gb node, the scheduler just runs 'df' and if there's 20Gb free, says, 'go for it!' It doesn't take into account the fact that existing nodes might grow in the future.
I recently modified the scheduler to stop at 90% to allow for future VM growth, but that's just a randomly approximated hack.
Upshot: Yuvi's argument is right -- potential space usage of an existing tools node has no effect on our scheduler except at the moment of instance creation.