Many WDQS incidents are triggered by individual actors putting too much strain on our system. To assist with the triaging of these issues, we need to create a dashboard to monitor the top user agents by query volume. In addition to incident responding, this dashboard will be used to better understand users and identify candidates for alternative endpoints.
There is no requirement for tooling this view should be built on (e.g. Grafana, Superset, etc.)
Incident management
90 day retention policy won't be a problem and isn't worth escalating to get around. 90 days of trailing data provides sufficient perspective on which UAs are cropping up on a given day/week, which is the level of depth we need to understand spikes in lag and incidents with our system. We want trailing data is to be able to identify when changes in volume from a given user might contribute to logjams in our infra. We want to catch when a user has been submitting ~1k requests a day for the last 80 days, and ~10k in the last 10.
Supporting workflows efficiently
If a UA has had consistent behavior for 90+ days, I think we can safely conclude it's not the sole cause of a random escalation. 90 days is enough imo to give us a sense of what a persona is using the system for. Sustained high volume/complexity querying over that time period would be an alarm for us to evaluate if they should be redirected to Wikimedia Enterprise (WME), data dumps, or GraphQL.
AC:
- Results are ordered by unique UA string
- Results include total query volume by UA string
- Results update on a regular basis (minimum 10 minutes)
- Aggregated Response Size is provided for each unique UA from a set time period (see example here)
- (stretch) Average Query Latency is provided for each unique UA from a set time period
- (stretch) Query Success Rate is provided for each unique UA from a set time period