Page MenuHomePhabricator

Observability for function-* services
Closed, ResolvedPublic

Description

Per the Wikimedia Services Policy, in order to launch the function orchestrator and evaluator services, these services need to provide operational metrics and logging, "according to the current WMF standards specified in the implementation guidelines."

Because we're using ServiceTemplateNode, we get some basic metrics and logging by default. We'll also get some additional metrics 'for free' by virtue of running behind the Envoy reverse-proxy middleware that SRE has set up.

TODOs:

  • Determine what metrics are required to satisfy the “WMF standards” for metrics and logging mentioned in the Service Policy document.
    • After chatting with SREs, it seems like the full list of requirements are still to come. In general if we use Service Template Nodes it should take care of 90% of the requirements.
  • List all the metrics provided by ServiceTemplateNode and Envoy.
  • Determine which additional metrics we want to report which are not provided by default, and what else we want to log.
  • Determine whether ServiceTemplateNode provides APIs for custom, application-specific logging and monitoring.
    • Yes.
  • Write the code for collecting and reporting additional metrics and logging additional events.
  • Write a page on Wikitech explaining how it all works.

This task has a dependency on T307722 (Define SLIs and SLOs for function-* services) but some of the work can happen in parallel.

Event Timeline

ori updated the task description. (Show Details)

As part of the observability goal, we are looking to implement periodic health checks to monitor the uptime of the function orchestrator and evaluator.

@JMeybohm do you have recommendations for frameworks we could use to monitor whether a service is down? Thanks!

We'd also like to invite @JMeybohm and @akosiaris to review the wikifunctions observability doc. We'd really appreciate your input and feedback:) Thanks!

As part of the observability goal, we are looking to implement periodic health checks to monitor the uptime of the function orchestrator and evaluator.

@JMeybohm do you have recommendations for frameworks we could use to monitor whether a service is down? Thanks!

We usually do this on the highest level possible via checks to the user facing services, see "probes:" in https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Create_an_entry_in_the_service::catalog. The probes stanza there configures instances of the prometheus blackbox_exporter to run the actual checks (https://wikitech.wikimedia.org/wiki/Network_monitoring#Blackbox_Probes_%28Prometheus%29).

While I do think this is the right approach for the orchestrator, I'm not sure about the evaluator(s) as I guess they might not have a dedicated entry in the service::catalog (I might be wrong). If it's really just about uptime, rater than availability, the "up" metric in prometheus might be enough. That will be 0 if the prometheus endpoint of your service could not be scraped but that ofc. does not tell if the service was actually functioning.

We'd also like to invite @JMeybohm and @akosiaris to review the wikifunctions observability doc. We'd really appreciate your input and feedback:) Thanks!

Left a couple of comments as well in the doc, but most of it LGTM already. Thanks for being proactive in this!

We usually do this on the highest level possible via checks to the user facing services, see "probes:" in https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Create_an_entry_in_the_service::catalog. The probes stanza there configures instances of the prometheus blackbox_exporter to run the actual checks (https://wikitech.wikimedia.org/wiki/Network_monitoring#Blackbox_Probes_%28Prometheus%29).

We'd also like to set up IRC alerting for the team (on #wikipedia-abstract-tech) if the Beta Cluster instances are down. Since the Beta Cluster isn't running Kubernetes, is there an alternative we can use?

We usually do this on the highest level possible via checks to the user facing services, see "probes:" in https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Create_an_entry_in_the_service::catalog. The probes stanza there configures instances of the prometheus blackbox_exporter to run the actual checks (https://wikitech.wikimedia.org/wiki/Network_monitoring#Blackbox_Probes_%28Prometheus%29).

We'd also like to set up IRC alerting for the team (on #wikipedia-abstract-tech) if the Beta Cluster instances are down. Since the Beta Cluster isn't running Kubernetes, is there an alternative we can use?

The Beta Cluster is definitely a completely different environment than production infra by now, so reusing anything from production is probably either a straight no go or needs to be restructured extensively. As serviceops, I fear we don't have much to offer as insights on this.