Page MenuHomePhabricator

[4hrs] Have a Grafana dashboard for Proton
Closed, ResolvedPublic0 Estimated Story Points

Description

There is a Grafana dashboard for Proton in Beta, but there is none for production. It needs to be created before we can start sending requests from RESTBase.

Related Objects

StatusSubtypeAssignedTask
Resolvedovasileva
ResolvedNone
ResolvedBawolff
Resolvedphuedx
Resolved mobrovac
Resolved mobrovac
Resolvedphuedx
ResolvedJdrewniak
Resolvedphuedx
Resolvedphuedx
Resolvedphuedx
Resolvedphuedx
DeclinedNone
Resolvedbmansurov
Resolved mobrovac
Resolvedovasileva
InvalidNone
ResolvedJdlrobson
Resolvedphuedx
Resolvedphuedx
Resolved holger.knust
ResolvedTgr
Resolvedjijiki
ResolvedMSantos
Resolved mobrovac
Resolvedovasileva
Resolvedphuedx
Declinedpmiazga
ResolvedDzahn
Resolvedpmiazga
Duplicate holger.knust
ResolvedMSantos
ResolvedTgr
ResolvedJohan
OpenNone
Resolvedovasileva
InvalidNone
Resolved mobrovac
Resolvedpmiazga

Event Timeline

mobrovac created this task.

By simply exporting one from beta cluster grafana and importing it into production grafana I've created this https://grafana.wikimedia.org/dashboard/db/proton?orgId=1

However, a couple of questions for @pmiazga who was the one creating the dashboard in the beginning:

  1. What's queue management vs jobs management? Could you either clarify the names or add a legend? Why is abandoned metric has .count.count in it?
  2. I believe that you wanted to use rates vs counts - counts are kinda meaningless cause they get nullified only on every flush, which is kinda arbitrary if I understand that correctly.
  3. Please add axis measurement units legend to all the graphs and also select appropriate units for the axis (for example latency graph uses short now while it should use ms
  4. All the services dashboards expose heap and GC graphs - please add them, you can find the examples on MCS dashboard https://grafana.wikimedia.org/dashboard/db/mobileapps?orgId=1
  5. I do not believe all the metrics are correct. The request rate from the last graph is 0.30 but the sum of request rates by type and by format are both 0.20 - where did 0.10 requests/s go?

Thank you, @Pchelolo !

To add to the list, I took a look at the queue mgmt panel, and I believe there we are interested in mean and p99 queue sizes per host, not the overall accumulated queue size across all nodes since the health of the system directly depends on the individual sizes.

@Pchelolo

By simply exporting one from beta cluster grafana and importing it into production grafana I've created this https://grafana.wikimedia.org/dashboard/db/proton?orgId=1

However, a couple of questions for @pmiazga who was the one creating the dashboard in the beginning:

  1. What's queue management vs jobs management? Could you either clarify the names or add a legend? Why is abandoned metric has .count.count in it?

Queue -> it's the time when the task waits before it gets picked to render
Job -> the task got picked and started rendering,
The main difference is that we can have a pretty long queue and render only a couple of pages at once, we specify the different timeout/max size for the queue (as the task in this state do not require any resources) and different timeout for the jobs (as the task is processing and it requires the chromium instance)
about the count.count -> I'm not sure, I need to check the code. Most probably I named the bucket as something.count and then I take count of that.

  1. I believe that you wanted to use rates vs counts - counts are kinda meaningless cause they get nullified only on every flush, which is kinda arbitrary if I understand that correctly.

Thanks for the info, I'll check it

  1. Please add axis measurement units legend to all the graphs and also select appropriate units for the axis (for example latency graph uses short now while it should use ms

Will do, thanks for the tip

  1. All the services dashboards expose heap and GC graphs - please add them, you can find the examples on MCS dashboard https://grafana.wikimedia.org/dashboard/db/mobileapps?orgId=1

Will do, thanks for the tip

  1. I do not believe all the metrics are correct. The request rate from the last graph is 0.30 but the sum of request rates by type and by format are both 0.20 - where did 0.10 requests/s go?

Thos could be rejected request (because of timeouts/failed jobs etc). I'll check the graphs once again.

In general, I created those Grafana dashboards as a proof-of-concept, without spending too much time on those, first, we wanted to see, that there is any traffic to the Proton service and system logs stats properly. I'll revisit each graph and verify that it works properly and shows valuable information.
Also, as I stated earlier, currently we log/track much more than necessary, it's mostly to verify the service health during rigorous testing. Once service gets to production and stays healthy we should be able to clean up the graph and show only important bits without unnecessary clutter.

Change 460113 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[mediawiki/services/chromium-render@master] Hygiene: remove redundant .count

https://gerrit.wikimedia.org/r/460113

Moved to workboard as I'm actively working on it

ovasileva renamed this task from Have a Grafana dashboard for Proton to [4hrs] Have a Grafana dashboard for Proton.Sep 18 2018, 4:08 PM
ovasileva set the point value for this task to 0.

Change 462579 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[mediawiki/services/chromium-render@master] Html2pdf route should return promise

https://gerrit.wikimedia.org/r/462579

We need a review from the services team.

Change 462579 merged by jenkins-bot:
[mediawiki/services/chromium-render@master] Html2pdf route should return promise

https://gerrit.wikimedia.org/r/462579

@mobrovac @Pchelolo could you update the proton to the latest version (include the html2pdf changes)

@mobrovac @Pchelolo could you update the proton to the latest version (include the html2pdf changes)

We'll be doing that together tomorrow ;)

Mentioned in SAL (#wikimedia-operations) [2018-10-04T13:43:39Z] <pmiazga@deploy1001> Started deploy [proton/deploy@ecb9a0e]: Bugfix:handle undefined response and fix grafana stats (T186748,T201158)

Mentioned in SAL (#wikimedia-operations) [2018-10-04T13:46:35Z] <pmiazga@deploy1001> Finished deploy [proton/deploy@ecb9a0e]: Bugfix:handle undefined response and fix grafana stats (T186748,T201158) (duration: 02m 55s)

Change 460113 merged by Ppchelko:
[mediawiki/services/chromium-render@master] Hygiene: remove redundant .count

https://gerrit.wikimedia.org/r/460113

This task requires redeployment, I'll deploy code on Monday, Oct 15th

I did fix the latency graph: https://grafana.wikimedia.org/dashboard/db/proton?orgId=1&from=now-3h&to=now

However, it uncovered some even bigger issue - all I needed to do is to restart the service manuall via sudo service proton restart. Doing scap deploy --service-restart -f for some reason had no effect.

I did fix the latency graph: https://grafana.wikimedia.org/dashboard/db/proton?orgId=1&from=now-3h&to=now

However, it uncovered some even bigger issue - all I needed to do is to restart the service manuall via sudo service proton restart. Doing scap deploy --service-restart -f for some reason had no effect.

Indeed, that's really strange. I created T207263: Scap not restarting Proton to that effect.

quoting @Pchelolo from IRC:

Pchelolo: also at this point I'm pretty satisfied with what we got on the dashboard
Pchelolo: what we have now is stable enough

I think we can resolve this task.

Pchelolo removed a project: Patch-For-Review.

Ye. We can tune and tweak it indefinitely, but for now I think we're in a good state