Description
Because we are working with Enterprise, we have higher expectations for performance and response rates. Because of this, we need the ability to quickly identify and resolve bottle necks within the overall calculation flow to ensure we stay within their defined SLOs. Currently, they commit to delivering a response within 500ms to their customers. We must ensure that we are comfortably below that limit so that Attribution API response values can be pulled and included in the Enterprise structured responses. This monitoring will also help quantify general risk to Wikimedia projects, as we will know where bottlenecks may affect production performance.
Conditions of acceptance
- On the median response latency, add an indicator for a 400ms limit so that we can see if the average is trending towards our hard limit.
- Track the % of requests that exceed the 400ms limit in grafana
- Log the specific requests, so that we have details to dig into when errors occur
Implementation details
This assumes that we will create a dedicated Grafana dashboard for Attribution API monitoring. This ticket includes creating the initial dashboard if this is the first monitoring task implemented.
Latency data is already collected by default; the scope of this is predominately adding thresholds.
We may need additional handling for measuring response time with parameters; the total latency should reflect the full round trip for the request to be returned to callers.