Page MenuHomePhabricator

Monitor all management interfaces
Closed, ResolvedPublic

Description

Yesterday during ganeti outage (https://wikitech.wikimedia.org/wiki/Incident_documentation/20170629-ganeti) it was discovered many management interfaces were not reachable. There's several levels of monitoring possible, in increasing order of implementation difficulty:

  1. ICMP ping
  2. Port 22 reachable
  3. SSH working (without actually authenticating)
  4. Remote console working (i.e. issue console com2 or vsp)

The easiest is probably 3. via an icinga check (e.g. every half an hour) that ssh handshake happens and authentication would be possible. 4. would be very desirable but definitely more difficult as the management password is needed, or public key authentication needs to be implemented (T113557)

Event Timeline

akosiaris subscribed.

As far as yesterday's outage is concerned, even 1 would have prevented it. But indeed if we can get 1 we can easily get 3 as well.

Well, it depends on how we would monitor them. Yesterday's issue wasn't caused by management interfaces being unreachable, but by their DNS pointing to the wrong address. I just pushed a couple of changesets to get the BMC's perspective on what its IP address is and we can very easily add check_ping or TCP 22 checks for that, but it wouldn't have caught yesterday's issues.

We'd need to do a DNS consistency check to check for yesterday's outage, at least until we get to a place where we autogenerate these kind of things.

Change 363295 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] monitoring::host: Monitor IPMI as well if applicable

https://gerrit.wikimedia.org/r/363295

Change 364402 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] monitoring::host: Rename the mgmt hash key

https://gerrit.wikimedia.org/r/364402

Change 364402 merged by Alexandros Kosiaris:
[operations/puppet@production] monitoring::host: Rename the mgmt hash key

https://gerrit.wikimedia.org/r/364402

Change 364415 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] monitoring: Check mgmt SSH availability as well

https://gerrit.wikimedia.org/r/364415

Change 364415 merged by Alexandros Kosiaris:
[operations/puppet@production] monitoring: Check mgmt SSH availability as well

https://gerrit.wikimedia.org/r/364415

And with the above merged, I think we can resolve this. Of course we have a nice number of actionables from this. e.g. https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?servicegroup=mgmt&style=detail&servicestatustypes=29

From the hosts above half are tracked in T150160 and the rest in T169360. I 'll resolve this.