Page MenuHomePhabricator

Service & API product monitoring instrumentation
Open, Needs TriagePublic

Description

Background/Goal

In order to understand how our APIs are performing, we will need to establish a set of monitoring instruments to track emergent Service Level Objectives (SLOs) against our Service Level Indicators (SLIs) as well as overall Product Health Indicators (PHIs).

Use Cases

In order to ensure access equity to our services and open APIs, we need to understand how they are being used, so we can identify ways to improve them with real user monitoring informing decision making.

Additional use case: When we migrate users from AQS to AQS 2.0 services, we need to ensure we are meeting current user behavior without breaking changes or unnecessary latency or toil.

User Story

As an API platform team member, I need visibility into the user experience of our services and APIs, so I can ensure continuity and improvement of services and APIs.

⭐️ This will also allow a lower barrier to entry and transparency into service usage for movement communities.

Personas

See T333852: Define user entities for API products for API Persona work in process

Outcome

100% of decisions related to APIs & services will be data informed with insights from real user behavior, traffic, and performance.

Objective & Key Result(s)

O: Users can discover and consume reliable APIs that are easy to use.
KR: Users can see and understand real time usage, issues, and analytics for 6 APIs
KR: Users understand all throttling and rate limits so they can use them with confidence for their projects [decrease in 429 errors]
KR: Consumers see a continuous stream of  XX# per /month improvements & fixes that meet new and existing use case requirements

Acceptance Criteria

We will need:

  • Realtime dashboard(s)
  • Ability to create weekly, monthly, quarterly and annual reports shared on wiki
  • Durability: collected data will be stored for X amount of time. (TBD)

Reports and dashboards will track service level objectives (SLOs) & service level indicators (SLIs):

  • API endpoint requests per minute & total
  • Error/success responses per minute & total
  • Uptime averages
  • Average & max latency
  • CPU usage
  • Memory usage

As well as product health indicators (PHIs):

  • API endpoint usage growth
  • unique API consumers
  • top consumers by API usage
  • API retention/churn

Ideally these metrics are available in both dashboards and aggregated reports and cover the items listed in the table below:

Service Level Objectives
Availability/Uptime% of the time averaged over X time frame
Request latency/Response time
(requests per second served in percentile)
HTTP calls are completed in under XX ms
Round trip time RTT
Time to first byte TTFB
Request throughput (number of service requests)when they are processed in X amount of data exchanges per X second, which is
Maximum system bandwidthX% of the X total amount able to be transferred, when
Network latency/Traffic volumeX is the maximum amount of traffic. (Not including circumstances outside our control: geographical distance, weather, service provider capacity, and caller side misconfigurations or malfunctions)
Availability/Yield(fraction of the time that a service is usable/the fraction of well-formed requests that succeed)
Error rates(per error code, fraction of all requests)
Data rate limitsX amount of data per X call. X amount of data per X timeframe
Call rate limitsX number of calls per X time limit
Quality of service QoS"equitable" to be defined collectively
Error budget% of variation acceptable

Service level indicator reporting should include data that answers the following questions based on *actual value/range* as learned through real user monitoring:

  • Could we respond to the requests?
  • How long did it take to respond?
  • How many requests could be handled?
  • How long does it take to read or write data?
  • Can we access the data on demand?
  • Is the data still there when we need it?
  • How much data is being processed?
  • How long does it take the data to progress from ingestion to completion?
  • Was the right data retrieved?

Additionally, we will need to be able to distill *Product Health Indicators*

  • API usage growth over time overall and by endpoint
  • unique API consumers overall and by endpoint
  • top consumers by overall API & endpoint usage
  • API retention/churn
  • Time to first hello world (TTFHW)
  • API calls per workflow/business transaction

Key Tasks & Dependencies

  • Project kickoff
  • Scope of work TBD

Reference

Phab tasks related to API & service usage data

Event Timeline

VirginiaPoundstone renamed this task from Service & API Product monitoring instrumentation to Service & API product monitoring instrumentation.May 1 2023, 7:05 PM