Monitor all management interfaces
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Jun 30 2017, 11:32 AM

Description

Yesterday during ganeti outage (https://wikitech.wikimedia.org/wiki/Incident_documentation/20170629-ganeti) it was discovered many management interfaces were not reachable. There's several levels of monitoring possible, in increasing order of implementation difficulty:

ICMP ping
Port 22 reachable
SSH working (without actually authenticating)
Remote console working (i.e. issue console com2 or vsp)

The easiest is probably 3. via an icinga check (e.g. every half an hour) that ssh handshake happens and authentication would be possible. 4. would be very desirable but definitely more difficult as the management password is needed, or public key authentication needs to be implemented (T113557)

Details

	Subject	Repo	Branch	Lines +/-
	monitoring: Check mgmt SSH availability as well	operations/puppet	production	+9 -1
	monitoring::host: Rename the mgmt hash key	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Volans	T150160 Remote IPMI doesn't work for ~2% of the fleet
		Resolved		akosiaris	T169321 Monitor all management interfaces

Event Timeline

fgiunchedi created this task.Jun 30 2017, 11:32 AM

As far as yesterday's outage is concerned, even 1 would have prevented it. But indeed if we can get 1 we can easily get 3 as well.

fgiunchedi updated the task description. (Show Details)Jun 30 2017, 11:50 AM

Well, it depends on how we would monitor them. Yesterday's issue wasn't caused by management interfaces being unreachable, but by their DNS pointing to the wrong address. I just pushed a couple of changesets to get the BMC's perspective on what its IP address is and we can very easily add check_ping or TCP 22 checks for that, but it wouldn't have caught yesterday's issues.

We'd need to do a DNS consistency check to check for yesterday's outage, at least until we get to a place where we autogenerate these kind of things.

• jcrespo awarded a token.Jun 30 2017, 1:15 PM

faidon mentioned this in T169360: Unresponsive/misconfigured iDRACs over the host-BMC interface.Jun 30 2017, 6:11 PM

Peachey88 subscribed.Jul 1 2017, 11:17 AM

faidon moved this task from Inbox to In progress on the observability board.Jul 10 2017, 12:37 PM

Change 363295 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] monitoring::host: Monitor IPMI as well if applicable

https://gerrit.wikimedia.org/r/363295

gerritbot added a project: Patch-For-Review.Jul 10 2017, 3:14 PM

akosiaris claimed this task.Jul 10 2017, 3:19 PM

akosiaris merged a task: T85143: Monitor all mgmt hosts.Jul 10 2017, 3:27 PM

akosiaris added a subscriber: Matanya.

Change 364402 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] monitoring::host: Rename the mgmt hash key

https://gerrit.wikimedia.org/r/364402

Change 364402 merged by Alexandros Kosiaris:
[operations/puppet@production] monitoring::host: Rename the mgmt hash key

https://gerrit.wikimedia.org/r/364402

Change 364415 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] monitoring: Check mgmt SSH availability as well

https://gerrit.wikimedia.org/r/364415

Change 364415 merged by Alexandros Kosiaris:
[operations/puppet@production] monitoring: Check mgmt SSH availability as well

https://gerrit.wikimedia.org/r/364415

And with the above merged, I think we can resolve this. Of course we have a nice number of actionables from this. e.g. https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?servicegroup=mgmt&style=detail&servicestatustypes=29

From the hosts above half are tracked in T150160 and the rest in T169360. I 'll resolve this.

Monitor all management interfacesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Monitor all management interfaces
Closed, ResolvedPublic
Actions

Related Objects
Search...