The Growth team's features (especially Structured tasks) depend on several services or pipelines that need to operate properly. If any of the services or pipelines breaks, the user experiences for newcomers significantly worsens. Since Growth's features are intended for newcomers, user reports are fairly rare (only [some] experienced users know how to report bugs into Phabricator, experienced users don't depend on or use Growth features, thus don't notice, and newcomers usually do not manage to fill a Phabricator report). This means automated monitoring is increasingly important.
This task tracks initial implementation of Growth-related alerting, which can live in Alertmanager. Preferred method of pinging Growth-Team engineers is via the #growth-engine-room Slack channel (this can be done by Alertmanager sending out an email to the Slack channel intake address).
As part of this task, we should implement alerting for the following error cases:
- linkrecommendation service is unavailable/has more errors than allowed (T341710)
- number of recommendations in the task pool drops significantly
- ...