Page MenuHomePhabricator

Improve CXServer Grafana dashboard
Closed, ResolvedPublic4 Estimated Story Points

Assigned To
Authored By
Nikerabbit
Nov 11 2024, 9:42 AM
Referenced Files
F58423901: image.png
Feb 19 2025, 12:06 PM
F58423898: image.png
Feb 19 2025, 12:06 PM
F58423892: image.png
Feb 19 2025, 12:06 PM
F58423885: image.png
Feb 19 2025, 12:06 PM
F58398374: image.png
Feb 14 2025, 11:09 AM
F58228318: image.png
Jan 20 2025, 4:27 AM
F58218135: image.png
Jan 17 2025, 6:18 AM
F58218133: image.png
Jan 17 2025, 6:18 AM

Description

Dashboard is at https://grafana.wikimedia.org/d/F7rttgqmz/cxserver?orgId=1&refresh=30s

It can be updated by logging in to grafana-rw.wikimedia.org

Things that should be fixed:

  • Add missing MT services
  • This graph looks wrong: Quantiles has micro and milli seconds mixup it seems

image.png (338×624 px, 37 KB)

  • MT engine total character counts graph get reset often making the numbers not reliable

Event Timeline

Nikerabbit triaged this task as Medium priority.Nov 11 2024, 9:49 AM
Nikerabbit set the point value for this task to 4.
Nikerabbit renamed this task from Improve CXServer dashboard to Improve CXServer Logstash dashboard.Dec 3 2024, 8:26 AM
Nikerabbit renamed this task from Improve CXServer Logstash dashboard to Improve CXServer Grafana dashboard.Dec 3 2024, 8:49 AM
abi_ subscribed.

Started working on this

I've made the following tweaks:

  1. Updated MT engine total character counts to display MinT
  2. Updated Quantiles - removed p10 values as the value 10 for histogram_quantile doesn't make sense, which only accepts values between 0-1.

Regarding the values in the Quantiles, while they look incorrect, I can't tell what's wrong by looking at the query.

In the Quantiles graph, all values were there, but count value 1 caused it hiding all other graphs. I disabled that it looks ok now.

image.png (685×920 px, 94 KB)

In the Quantiles graph, all values were there, but count value 1 caused it hiding all other graphs. I disabled that it looks ok now

Cool, that makes sense.

Pending item:

MT engine total character counts graph get reset often making the numbers not reliable

Current values displayed as of 2025-01-13 - 1730 UTC:

image.png (396×1 px, 17 KB)

Graph appears to have reset:

image.png (408×1 px, 17 KB)

We did a production CXServer deployment today, and it appears that the MT engine total character counts graph got reset:

Pre-deployment

image.png (452×1 px, 28 KB)

Post-deployment
image.png (485×1 px, 28 KB)

Lets try to see if the behavior continues a couple of more times.

The graph doesn't appear to have reset since Friday:

image.png (387×1 px, 17 KB)

Something like "expr": "sum(increase(translate_Apertium_charcount[$__interval]))", could help in Grafana to avoid resets due to service restarts.

Nikerabbit changed the task status from Open to In Progress.Feb 5 2025, 1:22 PM

Something like "expr": "sum(increase(translate_Apertium_charcount[$__interval]))", could help in Grafana to avoid resets due to service restarts.

Tried that. It shows no data for anything less than the last 2 days which is weird.

image.png (1×2 px, 120 KB)

I ended up using: sum(increase(translate_Apertium_charcount[$__range]))

  • The increase() function calculates how much the counter increased over a given time window.
  • $__range is the user-selected time range (e.g., if the user selects "Last 3 hours", then $__range = 3h).

This does the following:

  • Finds how many characters were translated in the last $__range (e.g., last 3 hours).
  • Handles resets automatically (if the counter drops due to a service restart).

This appears to work well:

Last 15 minutesLast 3 hoursLast 12 hoursLast 24 hours
image.png (402×1 px, 17 KB)
image.png (377×1 px, 17 KB)
image.png (384×1 px, 18 KB)
image.png (365×1 px, 16 KB)

I played around with the dashboard. It looks good to me.

I played around with the dashboard. It looks good to me.

Thanks, marking this as done.