Page MenuHomePhabricator

Add alerting to Uptime Monitoring of public wikibase.cloud services
Closed, ResolvedPublic3 Estimated Story Points

Description

If we wish to be able to act to ensure that our service it up it is valuable to both monitor what fraction of the time it is down as well as alert us in the event this happens.

These should alert to the existing monitoring email.

Some wbstack.com alerts that were setup in terraform can be found at https://github.com/wbstack/deploy/blob/main/tf/monitoring_alert_policy.tf

Useful links:

A/C:

  • Monitoring exists to check the uptime of all our public facing services
  • Alerting exists to notify developers in a real time way to wb-cloud-monitoring@wikimedia.de of one of these public facing services has been unavailable for 10 mins.

Event Timeline

I added basic public facing uptime checks for most services in https://github.com/wmde/wbaas-deploy/pull/136
the API is an exception here, and we probably need to add a route that can be used as a health check.

Also we still want to add alerting for checks etc, so leaving this in ready to pick up!

Addshore renamed this task from Add Monitoring to Wikibase.dev to Add Uptime Monitoring to public Wikibase.dev services.Jan 7 2022, 7:23 PM
Addshore updated the task description. (Show Details)

This first PR, https://github.com/wmde/wbaas-deploy/pull/136, has been deployed now as a starting point (as it was approved and in the repo review queue)

These initial checks can be found under uptime https://console.cloud.google.com/monitoring/uptime?referrer=search&project=wikibase-cloud

The checks are still rolling out, but they each have at least 1 success now

image.png (427×1 px, 60 KB)

Tarrow renamed this task from Add Uptime Monitoring to public Wikibase.dev services to Add Uptime Monitoring and Alerting to public wikibase.cloud services.Feb 23 2022, 8:51 PM
Tarrow updated the task description. (Show Details)
Tarrow moved this task from Backlog (incoming) to Ready to Pick Up on the Wikibase Cloud board.
Evelien_WMDE renamed this task from Add Uptime Monitoring and Alerting to public wikibase.cloud services to Add alerting to Uptime Monitoring of public wikibase.cloud services.Jul 18 2022, 1:55 PM
Evelien_WMDE updated the task description. (Show Details)
Evelien_WMDE set the point value for this task to 5.
Evelien_WMDE subscribed.

Note: We have not yet decided of we want monitoring on a failure of a single geographic check, a combination of them or all.

Rosalie_WMDE subscribed.

putting in review to have another pair of eyes on it before we try it on staging

toan subscribed.
Evelien_WMDE claimed this task.