Back in year zero of Wikimedia Labs, shockingly many services were confined to a single box. A server named 'virt0' hosted the Wikitech website, Keystone, Glance, Ldap, Rabbitmq, ran a puppetmaster, and did a bunch of other things.
Even after the move from the Tampa data center to Ashburn, the model remained much the same, with a whole lot of different services crowded onto a single, overworked box. Since then we've been gradually splitting out important services onto their own systems -- it takes up a bit more rack space but has made debugging and management much more straightforward.
Today I've put the final finishing touches on one of the biggest break-away services to date: The puppetmaster that manages most cloud instances is no longer running on 'labcontrol1001'; instead the puppetmaster has its own two-server cluster which does puppet and nothing else. VMs have been using the new puppetmasters for a few weeks, but I've just now finally shut down the old service on labcontrol1001 and cleaned things up.
With luck, this new setup will gain us some or all of the following advantages:
- fewer bad interactions between puppet and other cloud services: In particular, RabbitMQ (which manages most communication between openstack services) runs on labcontrol1001 and is very hungry for resources -- we're hoping it will be happier not competing with the puppetmaster for RAM.
- improved puppetmaster scalability: The new puppetmaster has a simple load-balancer that allows puppet compilations to be farmed out to additional backends when needed.
- less custom code: The new puppetmasters are managed with the same puppet classes that are used elsewhere in Wikimedia production.
Of course, many instances weren't using the puppetmaster on labcontrol1001 anyway; they use separate custom puppetmasters that run directly on cloud instances. In many ways this is better -- certainly the security model is simpler. It's likely that at some point we'll move ALL puppet hosting off of metal servers and into the cloud, at which point there will be yet another giant puppet migration. This last one went pretty well, though, so I'm much less worried about that move than I was before; and in the meantime we have a nice stable setup to keep things going.