Page MenuHomePhabricator

Implement alerting for Growth-consumed or Growth-managed services/pipelines
Open, Needs TriagePublic

Description

The Growth team's features (especially Structured tasks) depend on several services or pipelines that need to operate properly. If any of the services or pipelines breaks, the user experiences for newcomers significantly worsens. Since Growth's features are intended for newcomers, user reports are fairly rare (only [some] experienced users know how to report bugs into Phabricator, experienced users don't depend on or use Growth features, thus don't notice, and newcomers usually do not manage to fill a Phabricator report). This means automated monitoring is increasingly important.

This task tracks initial implementation of Growth-related alerting, which can live in Alertmanager. Preferred method of pinging Growth-Team engineers is via the #growth-engine-room Slack channel (this can be done by Alertmanager sending out an email to the Slack channel intake address).

As part of this task, we should implement alerting for the following error cases:

  • linkrecommendation service is unavailable/has more errors than allowed (T341710)
  • number of recommendations in the task pool drops significantly
  • ...

Event Timeline

Change 953347 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/puppet@production] alertmanager: route Growth team alerts

https://gerrit.wikimedia.org/r/953347

Change 953347 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: route Growth team alerts

https://gerrit.wikimedia.org/r/953347