| hypothesis |
|---|
| If we implement an initial set of automated supervision strategies and enable automated deployment to the Pretrain environment for the routing test, we will develop confidence that automated supervision is ready to serve as our primary risk mitigation for the full Pretrain MVP (e.g., false positives are understood, and we have identified a path to mitigate them). |
We will be working to add new system health metrics signals feeding into scap during a deployment action and configurable thresholds for those signals to trigger a halt and rollback response when the thresholds are broken. At some level what we are hoping to do is establish the tools needed to apply service level indicator (SLI) monitoring to service-level objective (SLO) agreements about runtime error tolerance during a deployment. Ideally the SLIs will measure output during a period where critical user journey activities for MediaWiki are exercised. Once proven to be useful during active deployments, future work would be likely to extend this software monitoring and response loop to be a continuous activity rather than a discrete action during deployments.
See also: