Garbage collect unmaintained/unused instances
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	faidon
	Jun 14 2015, 2:31 PM

Description

Over the weekend, we got flooded with ~6500 cronspam emails to root@wikimedia.org, some diamond/sudo errors that had to do with the resolver configuration & inability to resolve their own hostnames.

These were coming to production's root@wikimedia.org, which means that it was coming from instances that haven't being running puppet since before 0295b935034c685936f5a79c273696df6bb4521c got merged (back in mid-March).

It was all coming from four instances: mwreview-merl, towtruck, language-replag-wiki, sensu-01. The first three are unable to run puppet and the last puppet run they've had was 198299, 218308 and 198297 minutes ago. sensu-01 had a stale puppet lock, which I removed and now the puppet run fails due to some unrelated failure.

I've fixed their email configuration manually but I believe we should do something about instances like the above:

An instance that hasn't run puppet for 3+ months is likely unmaintained and could be unmaintained in all sorts of other ways (e.g. running insecure software).
An unmaintained instance is likely to be unused as well, which is a waste of resources
Those instances do not get important fixes that we deploy via puppet (e.g. mail configuration, like above)
Our puppet tree often has patterns where we fix something but leave backwards-compatible code (e.g. an ensure => absent), then remove it after some time passes. It's very rare that such code stays for 3 months.