To ensure timely and responsible handling of API issues, we will define clear ownership, establish SLAs, and implement a scalable, configurable alarm system. This will support both proactive detection and reactive incident response.
Subtasks:
- Update the API maintainers list for REST APIs
- Document ownership for:
- MediaWiki REST
- Wikimedia REST
- Extension APIs
- API Portal
- Other (e.g., AQS)
- [stretch] Action API modules
- Output: A centralized, maintained ownership list
- Document ownership for:
- Research robust monitoring approaches
- Interview teams with mature monitoring setups
- Capture patterns, tooling, and lessons learned
- Recommend a notification strategy
- Evaluate Slack, email, dashboards, and integrations
- Propose alert routing strategies to relevant teams
- Implement PoC alarms for MediaWiki REST APIs
- Trigger alarms based on latency, 5xx rates, etc.
- Ensure they are configurable and actionable
- Define MWI SLA for API issue resolution
- Define triage, escalation, and resolution timelines in collaboration with the MWI team
Acceptance Criteria:
- Ownership list is published and linked in internal documentation
- Notification channels are documented and aligned with MWI team preferences
- PoC alarms are implemented and validated for MediaWiki REST
- SLA is defined and reviewed by the MWI team