Observability for function-* services
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ori
	May 5 2022, 3:28 PM

Description

Per the Wikimedia Services Policy, in order to launch the function orchestrator and evaluator services, these services need to provide operational metrics and logging, "according to the current WMF standards specified in the implementation guidelines."

Because we're using ServiceTemplateNode, we get some basic metrics and logging by default. We'll also get some additional metrics 'for free' by virtue of running behind the Envoy reverse-proxy middleware that SRE has set up.

TODOs:

Determine what metrics are required to satisfy the “WMF standards” for metrics and logging mentioned in the Service Policy document.
- After chatting with SREs, it seems like the full list of requirements are still to come. In general if we use Service Template Nodes it should take care of 90% of the requirements.
List all the metrics provided by ServiceTemplateNode and Envoy.
- See metrics doc
Determine which additional metrics we want to report which are not provided by default, and what else we want to log.
- See metrics doc
Determine whether ServiceTemplateNode provides APIs for custom, application-specific logging and monitoring.
- Yes.
Write the code for collecting and reporting additional metrics and logging additional events.
Write a page on Wikitech explaining how it all works.

This task has a dependency on T307722 (Define SLIs and SLOs for function-* services) but some of the work can happen in parallel.

Related Objects

Mentioned In: T307722: Define SLIs and SLOs for function-* services
Mentioned Here: T307722: Define SLIs and SLOs for function-* services

Event Timeline

ori created this task.May 5 2022, 3:28 PM

ori updated the task description. (Show Details)

ori mentioned this in T307722: Define SLIs and SLOs for function-* services.May 5 2022, 4:17 PM

ori updated the task description. (Show Details)May 5 2022, 4:19 PM

ori added projects: function-orchestrator, function-evaluator.May 5 2022, 4:22 PM

• maryyang updated the task description. (Show Details)May 5 2022, 8:01 PM

ori added a project: 2022 Wikimedia Google.org Fellowship.May 6 2022, 12:41 AM

Jdforrester-WMF moved this task from To triage to Phase θ – Throttling on the Abstract Wikipedia team board.May 17 2022, 6:37 PM

Jdforrester-WMF edited projects, added Abstract Wikipedia team (Phase θ – Throttling); removed Abstract Wikipedia team.

Jdforrester-WMF moved this task from Incoming to Ready: G2. Correct & efficient on the Abstract Wikipedia team (Phase θ – Throttling) board.

As part of the observability goal, we are looking to implement periodic health checks to monitor the uptime of the function orchestrator and evaluator.

@JMeybohm do you have recommendations for frameworks we could use to monitor whether a service is down? Thanks!

• maryyang updated the task description. (Show Details)Jun 10 2022, 7:32 PM

We'd also like to invite @JMeybohm and @akosiaris to review the wikifunctions observability doc. We'd really appreciate your input and feedback:) Thanks!

In T307700#7995520, @maryyang wrote:

As part of the observability goal, we are looking to implement periodic health checks to monitor the uptime of the function orchestrator and evaluator.

@JMeybohm do you have recommendations for frameworks we could use to monitor whether a service is down? Thanks!

We usually do this on the highest level possible via checks to the user facing services, see "probes:" in https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Create_an_entry_in_the_service::catalog. The probes stanza there configures instances of the prometheus blackbox_exporter to run the actual checks (https://wikitech.wikimedia.org/wiki/Network_monitoring#Blackbox_Probes_%28Prometheus%29).

While I do think this is the right approach for the orchestrator, I'm not sure about the evaluator(s) as I guess they might not have a dedicated entry in the service::catalog (I might be wrong). If it's really just about uptime, rater than availability, the "up" metric in prometheus might be enough. That will be 0 if the prometheus endpoint of your service could not be scraped but that ofc. does not tell if the service was actually functioning.

In T307700#8000033, @maryyang wrote:

We'd also like to invite @JMeybohm and @akosiaris to review the wikifunctions observability doc. We'd really appreciate your input and feedback:) Thanks!

Left a couple of comments as well in the doc, but most of it LGTM already. Thanks for being proactive in this!

In T307700#8001625, @JMeybohm wrote:

We usually do this on the highest level possible via checks to the user facing services, see "probes:" in https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Create_an_entry_in_the_service::catalog. The probes stanza there configures instances of the prometheus blackbox_exporter to run the actual checks (https://wikitech.wikimedia.org/wiki/Network_monitoring#Blackbox_Probes_%28Prometheus%29).

We'd also like to set up IRC alerting for the team (on #wikipedia-abstract-tech) if the Beta Cluster instances are down. Since the Beta Cluster isn't running Kubernetes, is there an alternative we can use?

In T307700#8005699, @ori wrote:

In T307700#8001625, @JMeybohm wrote:

We usually do this on the highest level possible via checks to the user facing services, see "probes:" in https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Create_an_entry_in_the_service::catalog. The probes stanza there configures instances of the prometheus blackbox_exporter to run the actual checks (https://wikitech.wikimedia.org/wiki/Network_monitoring#Blackbox_Probes_%28Prometheus%29).

We'd also like to set up IRC alerting for the team (on #wikipedia-abstract-tech) if the Beta Cluster instances are down. Since the Beta Cluster isn't running Kubernetes, is there an alternative we can use?

The Beta Cluster is definitely a completely different environment than production infra by now, so reusing anything from production is probably either a straight no go or needs to be restructured extensively. As serviceops, I fear we don't have much to offer as insights on this.

ori closed this task as Resolved.Oct 12 2022, 2:33 PM

Observability for function-* servicesClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Observability for function-* services
Closed, ResolvedPublic
Actions