Improve our monitoring to more rely on probes
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• nfraison
	Feb 27 2023, 1:29 PM

Description

Our monitoring (mainly based on process availability and some internal metrics) can lack some issues.
For ex. test presto cluster had all is workers and coordinators process running well but the service was not working as the worker was not able to connect to the coordinator.
We could add an alert if number of worker is below a threshold but it is quite a statiscalert and we can also have other issues not seen by this kind of alerts (for ex. slowness or timeout on queries like the one failed with kerberos)

We should have a probe for all of our services so we can measure latencies and issues to access the service.
Then we will be able to add some alerts on top of those metrics.

The probe should simulate a client actions and report ok/ko for action + duration.
We must define per services what is appropriate action (size of the action, periodicity...)

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T330657 Improve our monitoring to more rely on probes
		Declined		None	T332038 Study blackbox exporter to see if it can be used to probe our web based service

Event Timeline

• nfraison created this task.Feb 27 2023, 1:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 27 2023, 1:29 PM

• nfraison moved this task from Backlog to Epics on the Shared-Data-Infrastructure board.Feb 27 2023, 1:30 PM

• nfraison added a project: Epic.

@BTullis @Stevemunene here is the epic we just discuss IRL.
If you are fine with it I'd like we start looking at this on this sprint.
For ex. adding one ticket to add probing on one of our web based service dathub or superset or turnilo?

Stevemunene moved this task from Epics to 2022-23 Q4 Wrap up on the Shared-Data-Infrastructure board.Mar 30 2023, 2:12 AM

Stevemunene edited projects, added Shared-Data-Infrastructure (2022-23 Q4 Wrap up); removed Shared-Data-Infrastructure.

JArguello-WMF edited projects, added Shared-Data-Infrastructure; removed Shared-Data-Infrastructure (2022-23 Q4 Wrap up).Mar 30 2023, 12:46 PM

Declining this ticket. It would be good to have in principle, but it's not really a priority right now.

Improve our monitoring to more rely on probesClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Improve our monitoring to more rely on probes
Closed, DeclinedPublic
Actions

Related Objects
Search...