Background/Goal
In order to understand how our APIs are performing, we will need to establish a set of monitoring instruments to track emergent Service Level Objectives (SLOs) against our Service Level Indicators (SLIs) as well as overall Product Health Indicators (PHIs).
Use Cases
In order to ensure access equity to our services and open APIs, we need to understand how they are being used, so we can identify ways to improve them with real user monitoring informing decision making.
Additional use case: When we migrate users from AQS to AQS 2.0 services, we need to ensure we are meeting current user behavior without breaking changes or unnecessary latency or toil.
User Story
As an API platform team member, I need visibility into the user experience of our services and APIs, so I can ensure continuity and improvement of services and APIs.
Personas
See T333852: Define user entities for API products for API Persona work in process
Outcome
100% of decisions related to APIs & services will be data informed with insights from real user behavior, traffic, and performance.
Objective & Key Result(s)
O: Users can discover and consume reliable APIs that are easy to use.
KR: Users can see and understand real time usage, issues, and analytics for 6 APIs
KR: Users understand all throttling and rate limits so they can use them with confidence for their projects [decrease in 429 errors]
KR: Consumers see a continuous stream of XX# per /month improvements & fixes that meet new and existing use case requirements
Acceptance Criteria
We will need:
- Realtime dashboard(s)
- Ability to create weekly, monthly, quarterly and annual reports shared on wiki
- Durability: collected data will be stored for X amount of time. (TBD)
Reports and dashboards will track service level objectives (SLOs) & service level indicators (SLIs):
- API endpoint requests per minute & total
- Error/success responses per minute & total
- Uptime averages
- Average & max latency
- CPU usage
- Memory usage
As well as product health indicators (PHIs):
- API endpoint usage growth
- unique API consumers
- top consumers by API usage
- API retention/churn
Ideally these metrics are available in both dashboards and aggregated reports and cover the items listed in the table below:
Service Level Objectives | |
---|---|
Availability/Uptime | % of the time averaged over X time frame |
Request latency/Response time (requests per second served in percentile) | HTTP calls are completed in under XX ms |
Round trip time RTT | |
Time to first byte TTFB | |
Request throughput (number of service requests) | when they are processed in X amount of data exchanges per X second, which is |
Maximum system bandwidth | X% of the X total amount able to be transferred, when |
Network latency/Traffic volume | X is the maximum amount of traffic. (Not including circumstances outside our control: geographical distance, weather, service provider capacity, and caller side misconfigurations or malfunctions) |
Availability/Yield | (fraction of the time that a service is usable/the fraction of well-formed requests that succeed) |
Error rates | (per error code, fraction of all requests) |
Data rate limits | X amount of data per X call. X amount of data per X timeframe |
Call rate limits | X number of calls per X time limit |
Quality of service QoS | "equitable" to be defined collectively |
Error budget | % of variation acceptable |
Service level indicator reporting should include data that answers the following questions based on *actual value/range* as learned through real user monitoring:
- Could we respond to the requests?
- How long did it take to respond?
- How many requests could be handled?
- How long does it take to read or write data?
- Can we access the data on demand?
- Is the data still there when we need it?
- How much data is being processed?
- How long does it take the data to progress from ingestion to completion?
- Was the right data retrieved?
Additionally, we will need to be able to distill *Product Health Indicators*
- API usage growth over time overall and by endpoint
- unique API consumers overall and by endpoint
- top consumers by overall API & endpoint usage
- API retention/churn
- Time to first hello world (TTFHW)
- API calls per workflow/business transaction
Key Tasks & Dependencies
- Project kickoff
- Scope of work TBD
Reference
- https://wikitech.wikimedia.org/wiki/SLO/template_instructions
- https://wikitech.wikimedia.org/wiki/SLO/template
- SLO Workshop - Maps