Page MenuHomePhabricator

labservices1001 crashed and sent no pages
Closed, ResolvedPublic

Description

Labservices1001 died (T152340) and I found out about it from the toolschecker. If the toolschecker had failed over properly between DNS hosts, we would not have know about this outage AT ALL until new VM creation started failing.

That's crazy. Why didn't a down host cause icinga to light up like a Christmas tree? Or, did it and I'm just not subscribed to those pages somehow?

Event Timeline

Andrew created this task.Dec 5 2016, 5:12 AM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptDec 5 2016, 5:12 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Paladox added a subscriber: Paladox.Dec 5 2016, 7:34 AM
Krenair added a subscriber: Krenair.Dec 5 2016, 7:51 PM

Change 325358 had a related patch set uploaded (by Andrew Bogott):
Page if a labs dns server stops responding

https://gerrit.wikimedia.org/r/325358

Change 325359 had a related patch set uploaded (by Andrew Bogott):
Designate: page if services go down.

https://gerrit.wikimedia.org/r/325359

Change 325358 abandoned by Andrew Bogott:
Page if a labs dns server stops responding

https://gerrit.wikimedia.org/r/325358

Change 325359 merged by Andrew Bogott:
Designate: page if services go down.

https://gerrit.wikimedia.org/r/325359

Andrew closed this task as Resolved.Dec 8 2016, 8:33 PM