Page MenuHomePhabricator

Check metrics and build dashboards for device-analytics
Closed, ResolvedPublic

Description

We currently have some metrics for device-analytics - find out whether they are fit for purpose, if we require additional ones and build useful dashboards from the information we have available.

Event Timeline

hnowlan added a subscriber: SGupta-WMF.

It appears that we currently only get the default go application metrics about the binary itself and various internal execution metrics. We will need to annotate our handlers to get actual per-endpoint histograms etc. I'm looking into how to do this with mux.

Things I can see being of use:

  • counters for requests
  • histograms for latencies of requests by endpoint (in most cases this will be a single endpoint I guess)
  • statistics around connections to Cassandra - failed connections, latencies on requests to cassandra
SGupta-WMF changed the task status from Open to In Progress.May 22 2023, 9:15 AM
SGupta-WMF claimed this task.

Making code changes to device-analytics as per discussion

Change 922075 had a related patch set uploaded (by Sg912; author: Sg912):

[generated-data-platform/aqs/device-analytics@main] Changing as per agreed naming conv Added prometheus metrics middleware for improved metrics

https://gerrit.wikimedia.org/r/922075

Change 922373 had a related patch set uploaded (by Sg912; author: Sg912):

[generated-data-platform/aqs/device-analytics@main] Renaming + Prom metrics

https://gerrit.wikimedia.org/r/922373

Change 922075 abandoned by Sg912:

[generated-data-platform/aqs/device-analytics@main] Changing as per agreed naming conv Added prometheus metrics middleware for improved metrics

Reason:

https://gerrit.wikimedia.org/r/922075

Change 922373 merged by BPirkle:

[generated-data-platform/aqs/device-analytics@main] Renaming + Prom metrics

https://gerrit.wikimedia.org/r/922373

I merged patchset 922373. If we decide we want/need more metrics info, we can add it in a separate change.

QA: tests should pass and /admin/metrics should return reasonable data.

Metrics verified .
Results :-

  • Metrics available on new updated path /metrics (See ticket T337428)
  • Can see request logs as per different status codes - 200 , 400 , 404 and 500
  • Bracketed as per time intervals

Status : QA pass for metrics , pending dashboards

@BPirkle Assigning this task to you for dashboard creation.

Dashboard created here:
https://grafana-rw.wikimedia.org/d/UWuaaNl4k/device-analytics-aqs-2-0?orgId=1

Relevant notes document here:
https://docs.google.com/document/d/1UmVbdrDVLQclNrfCzuGmI-bXYAsWv6bqSjgUqWISKbQ/edit#

This new device-analytics dashboard is a copy of the image-suggestion dashboard:
https://grafana-rw.wikimedia.org/d/SUZQ6rWVz/image-suggestion?orgId=1

Longer term, we may want to transform the new device-analytics dashboard into a full AQS 2.0 dashboard, with a dropdown for selecting which of the services to see details for. We didn't do that yet, because we only have one deployed AQS 2.0 service at this time.

I'm going to call this particular task ready for testing. We can do further enhancements under a separate task.

BPirkle triaged this task as Medium priority.
BPirkle moved this task from In Progress to Ready for Testing on the AQS2.0 (Sprint 10) board.

Looks good to me for the purposes of this ticket, thank you!