Page MenuHomePhabricator

Set up health checks for function-* services on Beta Cluster
Closed, ResolvedPublic

Description

Set up periodic health checks to make sure the function-orchestrator and function-evaluator services running on Beta Cluster are up and functioning on basic requests. We should be able to ping https://wikifunctions.beta.wmflabs.org/w/api.php with a simple JSON payload. Then we will verify the returned JSON contains the correct result (e.g. requesting 8+5 gets 13 back)

Monitoring for their prod instances is out of scope for this ticket, as it requires completely different infrastructure.

Event Timeline

Hi @maryyang, I'm part of SRE Observability and we're responsible for the production bits of monitoring. I'd like to understand better what the monitoring would look like in production; while it is explicitly out of scope for this task I think the production monitoring can help inform what to do in Beta (and ideally they'd work the same, even though deployed on two different infrastructures!)

For more context: I'm asking also because presumably the services will be part of service::catalog in puppet (this file: https://github.com/wikimedia/puppet/blob/production/hieradata/common/service.yaml). For such services the monitoring is achieved by the probes section, e.g.

probes:
  - type: http  # there's no HTTPS, TLS usage is governed by 'encryption: true'
    path: /healthz

While generally we're more oriented to check HTTP status codes to determine health, there's some support to POST json and check the response's body.

Said probes section at the moment works in production only, though I don't know the Beta status.

Hope that helps!

Change 810146 had a related patch set uploaded (by Mary Yang; author: Mary Yang):

[operations/puppet@production] DO-NOT-SUBMIT(Under local test, not yet ready for review) Add puppet profile and role files for wikifunctions.

https://gerrit.wikimedia.org/r/810146

Hi @fgiunchedi, thank you for the helpful input! The goal for health monitoring in prod should be similar to that in Beta: we want to make sure the services are up and returning correct responses for basic requests. Since the prod instance will be on kubernetes, the infrastructure we need to achieve the same goal will be different (or so we were told). This is why we are ready to make Beta monitoring a separate effort, since we likely cannot reuse the same setup. I hope that answers your questions!

To give some context, we are interested in monitoring the Beta services health for two main reasons:

  1. the Beta endpoint is technically public. We expect a small group of users to try it out in the near future.
  2. Our current E2E test suite targets the instances on Beta Cluster, and having these health checks would help us complete the loop wrt testing.

Currently for Beta Cluster, we are looking into leveraging the prometheus blackbox monitoring module with puppet. There are some unknowns since currently our service on Beta cluster was never puppetized, but I am trying it out with @Dzahn's help.

Thank you @maryyang for the context, yes that is helpful to know and does answer my questions. I don't know if the prometheus blackbox monitoring will work out of the box in beta (e.g. if there's prometheus and alertmanager in beta) but it is a good start for sure.

There's a Prometheus instance on beta, but it's running an outdated debian/prometheus version and isn't hooked to any alerting system.

@taavi when you say "beta prometheus isn't hooked to any alerting system", does it mean it's impossible to set up alerting on a beta cluster host, or it just has not been done before?

@Dzahn on the change we are working on (https://gerrit.wikimedia.org/r/c/operations/puppet/+/810146), idk if we'll need to specify a "host" for the prometheus blackbox monitoring to run on. If so, the abstract wiki team may need to provision a new host for this purpose, probably on beta cluster. If there is no alerting outlet in a beta cluster host, this could be a blocker..

I will add @taavi and @ori as reviewers on the change as well, so everyone can see what we are trying to do. Thanks everyone!

Change 811790 had a related patch set uploaded (by Mary Yang; author: Mary Yang):

[operations/puppet@production] Add alert manager alert receivers for the Abstract Wikipedia team.

https://gerrit.wikimedia.org/r/811790

@taavi when you say "beta prometheus isn't hooked to any alerting system", does it mean it's impossible to set up alerting on a beta cluster host, or it just has not been done before?

You would need to create an alertmanager host on beta with some sensible configuration and configure the prometheus instance to send alerts there. It's doable but requires some work.

Hi all, I have been trying to help Mary with this.

We have talked about puppetization and what it means compared to a service being on kubernetes and the different approaches to monitoring there.

Then I recommend to create a placeholder puppet role/profile which can be applied in cloud..and for starters does nothing but that monitoring check.

Further I recommended the new prometheus::blackbox monitoring instead of getting into the traditional way to create an Icinga check.

But turns out for this we would have to apply that _somewhere_ in production OR do all of these things you guys are describing (own alertmanager in cloud VPS etc.. which I think is asking a bit much from a user who just wants a single check.

In Icinga I would have solved this by adding a virtual host, which is just the public endpoint that is supposed to be checked..and then applied the check on that, regardless of actual instance / hostnames. Then in puppet code I would have put it into a generic place like the icinga module itself, rather than a service-specific role.

Since we are in the middle of moving from Icinga to alertmanager we can still do it the oldschool way as well. It shouldn't be hard and gives us a quick win, which is "that service in beta is monitored by something" (modulo having to setup notifications for the right people).

I have already talked with Mary about the more longterm plans a tiny bit and I understood that this service is going to move to production..it's just that until that is actually the case monitoring of the status quo is desired as well. This makes me think we can maybe separate this into "quick fix" and "longer term fix".

Can we also create a generic virtual host in alertmanager? Is it ok if we just do it _for now_ with Icinga and then it either gets migrated later to alertmanager or it just all becomes moot because when this moves into production it will just use all the production monitoring like everything else..

Change 811790 merged by Filippo Giunchedi:

[operations/puppet@production] Add alert manager alert receivers for the Abstract Wikipedia team.

https://gerrit.wikimedia.org/r/811790

Hi all, I have been trying to help Mary with this.

We have talked about puppetization and what it means compared to a service being on kubernetes and the different approaches to monitoring there.

Then I recommend to create a placeholder puppet role/profile which can be applied in cloud..and for starters does nothing but that monitoring check.

Further I recommended the new prometheus::blackbox monitoring instead of getting into the traditional way to create an Icinga check.

But turns out for this we would have to apply that _somewhere_ in production OR do all of these things you guys are describing (own alertmanager in cloud VPS etc.. which I think is asking a bit much from a user who just wants a single check.

In Icinga I would have solved this by adding a virtual host, which is just the public endpoint that is supposed to be checked..and then applied the check on that, regardless of actual instance / hostnames. Then in puppet code I would have put it into a generic place like the icinga module itself, rather than a service-specific role.

Since we are in the middle of moving from Icinga to alertmanager we can still do it the oldschool way as well. It shouldn't be hard and gives us a quick win, which is "that service in beta is monitored by something" (modulo having to setup notifications for the right people).

I have already talked with Mary about the more longterm plans a tiny bit and I understood that this service is going to move to production..it's just that until that is actually the case monitoring of the status quo is desired as well. This makes me think we can maybe separate this into "quick fix" and "longer term fix".

Can we also create a generic virtual host in alertmanager? Is it ok if we just do it _for now_ with Icinga and then it either gets migrated later to alertmanager or it just all becomes moot because when this moves into production it will just use all the production monitoring like everything else..

Thank you @Dzahn for the extended explanation and context, makes sense to me!

Agreed, if the service is reachable from production on a public DNS name then yes, let's go with blackbox::http::check on that hostname. The check can live on the alert hosts for now, and feel free to send the review my way. When we'll move to production then the service will be in service::catalog and we can re-use probes like everything else.

Agreed, if the service is reachable from production on a public DNS name then yes, let's go with blackbox::http::check on that hostname. The check can live on the alert hosts for now, and feel free to send the review my way. When we'll move to production then the service will be in service::catalog and we can re-use probes like everything else.

From the earlier comments (T311457#8053967), it seems like the intention is to keep monitoring the beta cluster services even after it's running on production? I'm not personally a fan of using production infrastructure to monitor services in WMCS excluding the cloud infra itself :/

The check can live on the alert hosts for now, and feel free to send the review my way.

Thank you @fgiunchedi ! In that case we don't need to create a placeholder role/profile for the service. Where should we put it? I would say it then belongs into class profile::alertmanager (custom config should not be inside a module). It's just that this would be the first to be added there. Does that seem like what you had in mind?

The check can live on the alert hosts for now, and feel free to send the review my way.

Thank you @fgiunchedi ! In that case we don't need to create a placeholder role/profile for the service. Where should we put it? I would say it then belongs into class profile::alertmanager (custom config should not be inside a module). It's just that this would be the first to be added there. Does that seem like what you had in mind?

Yes a separate class included by profile::alertmanager should work I think. From said class you can use prometheus::blackbox::check::http to actually perform the check.

Change 810146 had a related patch set uploaded (by Mary Yang; author: Mary Yang):

[operations/puppet@production] DO-NOT-SUBMIT(Under review and discussion): Add puppet profile and role files for wikifunctions.

https://gerrit.wikimedia.org/r/810146

Change 820788 had a related patch set uploaded (by Mary Yang; author: Mary Yang):

[mediawiki/extensions/WikiLambda@master] Add ApiHealthCheck in WikiLambda APIs.

https://gerrit.wikimedia.org/r/820788

Change 820788 merged by jenkins-bot:

[mediawiki/extensions/WikiLambda@master] Add ApiHealthCheck in WikiLambda APIs.

https://gerrit.wikimedia.org/r/820788

Change 820814 had a related patch set uploaded (by Mary Yang; author: Mary Yang):

[mediawiki/extensions/WikiLambda@master] Correct capitalization on summary for the WikiLambda health check.

https://gerrit.wikimedia.org/r/820814

Change 820814 merged by jenkins-bot:

[mediawiki/extensions/WikiLambda@master] Correct capitalization on summary for the WikiLambda health check.

https://gerrit.wikimedia.org/r/820814

Change 810146 merged by Ori:

[operations/puppet@production] Add puppet profile and role files for WikiFunctions.

https://gerrit.wikimedia.org/r/810146

Change 821256 had a related patch set uploaded (by Ori; author: Ori):

[operations/puppet@production] alertmanager: route abstract-wikipedia-critical alert e-mails to Slack

https://gerrit.wikimedia.org/r/821256

Change 821256 merged by Ori:

[operations/puppet@production] alertmanager: route abstract-wikipedia-critical alert e-mails to Slack

https://gerrit.wikimedia.org/r/821256

Change 821291 had a related patch set uploaded (by Ori; author: Ori):

[operations/puppet@production] alertmanager: route abstract-wikipedia-warning alert e-mails to Slack

https://gerrit.wikimedia.org/r/821291

Change 821291 merged by Ori:

[operations/puppet@production] alertmanager: fix abstract-wikipedia IRC channel name; route warnings to Slack

https://gerrit.wikimedia.org/r/821291

Change 821292 had a related patch set uploaded (by Mary Yang; author: Mary Yang):

[mediawiki/extensions/WikiLambda@master] Have WikiLambda health check catch all exceptions

https://gerrit.wikimedia.org/r/821292

Change 821294 had a related patch set uploaded (by Ori; author: Ori):

[operations/puppet@production] abstract-wikipedia alert: increase timeout; correct team name

https://gerrit.wikimedia.org/r/821294

Change 821294 merged by Ori:

[operations/puppet@production] abstract-wikipedia alert: increase timeout; correct team name

https://gerrit.wikimedia.org/r/821294

Update: we are running into some issues with the prometheus blackbox checks where the probes are timing out (10 second). The request should take ~3 seconds to resolve.

Looking at the logs error message, it says the request is not over SSL when it should be. This is also confirmed when I try to access the address via curl/browser. Perhaps it indicates a bigger issue, where the IP address -> URL reverse proxy is not set up (I vaguely recall this is the case?). This does not explain timeout though...when I make the request it's denied immediately.

If this is the case, we have two options: 1. set up the routing from IP to wikilambda. 2. set up the blackbox checks to use URL instead of IP.

What do people think?

@fgiunchedi I see some existing support for monitoring URLs that are hosted off-prod -- e.g. store.wikimedia.org:

https://github.com/wikimedia/puppet/blob/production/hieradata/common/profile/prometheus/ops.yaml#L25-L29

Is this something we can use?

@fgiunchedi I see some existing support for monitoring URLs that are hosted off-prod -- e.g. store.wikimedia.org:

https://github.com/wikimedia/puppet/blob/production/hieradata/common/profile/prometheus/ops.yaml#L25-L29

Is this something we can use?

Not out of the box, in the sense that "pingthing" (the name for the system we use that replaced watchmouse) is really tailored for SRE (as in, for example there's no support for matching response body and changing teams for alerts). Whereas the current prometheus::blackbox::check::http is meant/implemented to cater for multiple teams and various degrees of configuration/customisation. HTH!

@fgiunchedi thanks for the clarification! What would be our recommended course of action, e.g. perhaps to proxy the probes through another way?

@fgiunchedi thanks for the clarification! What would be our recommended course of action, e.g. perhaps to proxy the probes through another way?

you're welcome! essentially yes, adding proxy support to prometheus::blackbox::check::http I think is the next easiest thing to do. In practical terms this is another parameter for blackbox-exporter module configuration (proxy_url, in $http_module_params within modules/prometheus/manifests/blackbox/check/http.pp), hope that helps!

Change 822179 had a related patch set uploaded (by Mary Yang; author: Mary Yang):

[operations/puppet@production] Add proxy_url to prometheus::blackbox::check:http as a parameter.

https://gerrit.wikimedia.org/r/822179

Change 822181 had a related patch set uploaded (by Mary Yang; author: Mary Yang):

[operations/puppet@production] Use proxy for wikifunctions beta blackbox probe.

https://gerrit.wikimedia.org/r/822181

Change 822179 merged by Filippo Giunchedi:

[operations/puppet@production] Add proxy_url to prometheus::blackbox::check:http as a parameter.

https://gerrit.wikimedia.org/r/822179

Change 822181 merged by Filippo Giunchedi:

[operations/puppet@production] Use proxy for wikifunctions beta blackbox probe.

https://gerrit.wikimedia.org/r/822181

17:19 <+jinxer-wm> (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - 
                   https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443  - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
21:09 <+jinxer-wm> (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - 
                   https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443  - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
21:14 <+jinxer-wm> (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - 
                   https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443  - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired

the good: shows monitoring works and catches expiring certs

the bad: cert is about to expire (maybe needs to use certbot or setup local acme_chief and automation for renewal

maybe needs to use certbot or setup local acme_chief and automation for renewal

T293585 for that. Which has been open a while.

Oh, I forgot this is inside "beta" and not its own project. Well.. then.. as you say.. ticket is open and High prio but has not had any responses.