Page MenuHomePhabricator

Improve our monitoring to more rely on probes
Closed, DeclinedPublic

Description

Our monitoring (mainly based on process availability and some internal metrics) can lack some issues.
For ex. test presto cluster had all is workers and coordinators process running well but the service was not working as the worker was not able to connect to the coordinator.
We could add an alert if number of worker is below a threshold but it is quite a statiscalert and we can also have other issues not seen by this kind of alerts (for ex. slowness or timeout on queries like the one failed with kerberos)

We should have a probe for all of our services so we can measure latencies and issues to access the service.
Then we will be able to add some alerts on top of those metrics.

The probe should simulate a client actions and report ok/ko for action + duration.
We must define per services what is appropriate action (size of the action, periodicity...)

Event Timeline

@BTullis @Stevemunene here is the epic we just discuss IRL.
If you are fine with it I'd like we start looking at this on this sprint.
For ex. adding one ticket to add probing on one of our web based service dathub or superset or turnilo?

Declining this ticket. It would be good to have in principle, but it's not really a priority right now.