Page MenuHomePhabricator

Garbage collect unmaintained/unused instances
Closed, ResolvedPublic

Description

Over the weekend, we got flooded with ~6500 cronspam emails to root@wikimedia.org, some diamond/sudo errors that had to do with the resolver configuration & inability to resolve their own hostnames.

These were coming to production's root@wikimedia.org, which means that it was coming from instances that haven't being running puppet since before 0295b935034c685936f5a79c273696df6bb4521c got merged (back in mid-March).

It was all coming from four instances: mwreview-merl, towtruck, language-replag-wiki, sensu-01. The first three are unable to run puppet and the last puppet run they've had was 198299, 218308 and 198297 minutes ago. sensu-01 had a stale puppet lock, which I removed and now the puppet run fails due to some unrelated failure.

I've fixed their email configuration manually but I believe we should do something about instances like the above:

  • An instance that hasn't run puppet for 3+ months is likely unmaintained and could be unmaintained in all sorts of other ways (e.g. running insecure software).
  • An unmaintained instance is likely to be unused as well, which is a waste of resources
  • Those instances do not get important fixes that we deploy via puppet (e.g. mail configuration, like above)
  • Our puppet tree often has patterns where we fix something but leave backwards-compatible code (e.g. an ensure => absent), then remove it after some time passes. It's very rare that such code stays for 3 months.

Event Timeline

faidon raised the priority of this task from to Needs Triage.
faidon updated the task description. (Show Details)
faidon added a project: Cloud-Services.
faidon subscribed.

the towtruck instance runs the togetherjs hub used for my VE collaboration demos.

cscott, were you unaware that puppet was broken on towtruck? Would some sort of email alert system have encouraged you to maintain it properly?

chasemp claimed this task.

this happens periodically and has a few times since this last update