We currently have some metrics for device-analytics - find out whether they are fit for purpose, if we require additional ones and build useful dashboards from the information we have available.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Stalled | None | T324931 Clean up open RESTBase related tickets | |||
In Progress | None | T262315 <CORE TECHNOLOGY> API Migration & RESTBase Sunset | |||
In Progress | None | T263489 AQS 2.0 | |||
Resolved | SGupta-WMF | T288298 AQS 2.0: Device Analytics service | |||
Resolved | Atieno | T335505 Figure out what's outstanding to have device-analytics serving 100% Production data | |||
Resolved | hnowlan | T336158 Check metrics and build dashboards for device-analytics |
Event Timeline
It appears that we currently only get the default go application metrics about the binary itself and various internal execution metrics. We will need to annotate our handlers to get actual per-endpoint histograms etc. I'm looking into how to do this with mux.
Things I can see being of use:
- counters for requests
- histograms for latencies of requests by endpoint (in most cases this will be a single endpoint I guess)
- statistics around connections to Cassandra - failed connections, latencies on requests to cassandra
Change 922075 had a related patch set uploaded (by Sg912; author: Sg912):
[generated-data-platform/aqs/device-analytics@main] Changing as per agreed naming conv Added prometheus metrics middleware for improved metrics
Change 922373 had a related patch set uploaded (by Sg912; author: Sg912):
[generated-data-platform/aqs/device-analytics@main] Renaming + Prom metrics
Change 922075 abandoned by Sg912:
[generated-data-platform/aqs/device-analytics@main] Changing as per agreed naming conv Added prometheus metrics middleware for improved metrics
Reason:
Change 922373 merged by BPirkle:
[generated-data-platform/aqs/device-analytics@main] Renaming + Prom metrics
I merged patchset 922373. If we decide we want/need more metrics info, we can add it in a separate change.
QA: tests should pass and /admin/metrics should return reasonable data.
Metrics verified .
Results :-
- Metrics available on new updated path /metrics (See ticket T337428)
- Can see request logs as per different status codes - 200 , 400 , 404 and 500
- Bracketed as per time intervals
Status : QA pass for metrics , pending dashboards
Dashboard created here:
https://grafana-rw.wikimedia.org/d/UWuaaNl4k/device-analytics-aqs-2-0?orgId=1
Relevant notes document here:
https://docs.google.com/document/d/1UmVbdrDVLQclNrfCzuGmI-bXYAsWv6bqSjgUqWISKbQ/edit#
This new device-analytics dashboard is a copy of the image-suggestion dashboard:
https://grafana-rw.wikimedia.org/d/SUZQ6rWVz/image-suggestion?orgId=1
Longer term, we may want to transform the new device-analytics dashboard into a full AQS 2.0 dashboard, with a dropdown for selecting which of the services to see details for. We didn't do that yet, because we only have one deployed AQS 2.0 service at this time.
I'm going to call this particular task ready for testing. We can do further enhancements under a separate task.
@hnowlan Can you help us with a sign-off, please?
Cc.: @FJoseph-WMF @VirginiaPoundstone @SGupta-WMF @BPirkle