Etcd nodes simply do not seem to operate well on ceph as we have it laid out. We've made ceph very good and highly performant, but etcd has very peculiar sensitivities that make it so that nodes that should be blazing fast behave terribly.
We have been trying to move everything to a one-size-fits-all model where instance storage is in ceph, period, with cinder for attachable volumes. It seems that the only thing that won't quite work right for this is etcd. Interestingly, it operates pretty badly in PAWS even, which is a very quiet cluster compared to tools. Toolsbeta isn't any better with persistent single to double digit whole-number iowait, timeouts, failures, etc. We cannot be all-in on kubernetes and have the backing datastore constantly sucking.
I suggest we move clouddb1003/4 to cinder/ceph systems with appropriately downsized storage, use cloudvirt1019 and cloudvirt1020 and maybe one more cloudvirt to make reboots easier? I figure they need non-ceph flavors and probably something else? If there are three cloudvirts and etcd servers are in sets of three, with hard anti-affinity, you can always reboot one cloudvirt without evacuating (once toolsdb is not in the picture anyway).