Over the weekend, we got flooded with ~6500 cronspam emails to root@wikimedia.org, some diamond/sudo errors that had to do with the resolver configuration & inability to resolve their own hostnames.
These were coming to production's root@wikimedia.org, which means that it was coming from instances that haven't being running puppet since before 0295b935034c685936f5a79c273696df6bb4521c got merged (back in mid-March).
It was all coming from four instances: mwreview-merl, towtruck, language-replag-wiki, sensu-01. The first three are unable to run puppet and the last puppet run they've had was 198299, 218308 and 198297 minutes ago. sensu-01 had a stale puppet lock, which I removed and now the puppet run fails due to some unrelated failure.
I've fixed their email configuration manually but I believe we should do something about instances like the above:
- An instance that hasn't run puppet for 3+ months is likely unmaintained and could be unmaintained in all sorts of other ways (e.g. running insecure software).
- An unmaintained instance is likely to be unused as well, which is a waste of resources
- Those instances do not get important fixes that we deploy via puppet (e.g. mail configuration, like above)
- Our puppet tree often has patterns where we fix something but leave backwards-compatible code (e.g. an ensure => absent), then remove it after some time passes. It's very rare that such code stays for 3 months.