There is a Grafana dashboard for Proton in Beta, but there is none for production. It needs to be created before we can start sending requests from RESTBase.
Description
Details
Event Timeline
By simply exporting one from beta cluster grafana and importing it into production grafana I've created this https://grafana.wikimedia.org/dashboard/db/proton?orgId=1
However, a couple of questions for @pmiazga who was the one creating the dashboard in the beginning:
- What's queue management vs jobs management? Could you either clarify the names or add a legend? Why is abandoned metric has .count.count in it?
- I believe that you wanted to use rates vs counts - counts are kinda meaningless cause they get nullified only on every flush, which is kinda arbitrary if I understand that correctly.
- Please add axis measurement units legend to all the graphs and also select appropriate units for the axis (for example latency graph uses short now while it should use ms
- All the services dashboards expose heap and GC graphs - please add them, you can find the examples on MCS dashboard https://grafana.wikimedia.org/dashboard/db/mobileapps?orgId=1
- I do not believe all the metrics are correct. The request rate from the last graph is 0.30 but the sum of request rates by type and by format are both 0.20 - where did 0.10 requests/s go?
Thank you, @Pchelolo !
To add to the list, I took a look at the queue mgmt panel, and I believe there we are interested in mean and p99 queue sizes per host, not the overall accumulated queue size across all nodes since the health of the system directly depends on the individual sizes.
Queue -> it's the time when the task waits before it gets picked to render
Job -> the task got picked and started rendering,
The main difference is that we can have a pretty long queue and render only a couple of pages at once, we specify the different timeout/max size for the queue (as the task in this state do not require any resources) and different timeout for the jobs (as the task is processing and it requires the chromium instance)
about the count.count -> I'm not sure, I need to check the code. Most probably I named the bucket as something.count and then I take count of that.
- I believe that you wanted to use rates vs counts - counts are kinda meaningless cause they get nullified only on every flush, which is kinda arbitrary if I understand that correctly.
Thanks for the info, I'll check it
- Please add axis measurement units legend to all the graphs and also select appropriate units for the axis (for example latency graph uses short now while it should use ms
Will do, thanks for the tip
- All the services dashboards expose heap and GC graphs - please add them, you can find the examples on MCS dashboard https://grafana.wikimedia.org/dashboard/db/mobileapps?orgId=1
Will do, thanks for the tip
- I do not believe all the metrics are correct. The request rate from the last graph is 0.30 but the sum of request rates by type and by format are both 0.20 - where did 0.10 requests/s go?
Thos could be rejected request (because of timeouts/failed jobs etc). I'll check the graphs once again.
In general, I created those Grafana dashboards as a proof-of-concept, without spending too much time on those, first, we wanted to see, that there is any traffic to the Proton service and system logs stats properly. I'll revisit each graph and verify that it works properly and shows valuable information.
Also, as I stated earlier, currently we log/track much more than necessary, it's mostly to verify the service health during rigorous testing. Once service gets to production and stays healthy we should be able to clean up the graph and show only important bits without unnecessary clutter.
Change 460113 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[mediawiki/services/chromium-render@master] Hygiene: remove redundant .count
Change 462579 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[mediawiki/services/chromium-render@master] Html2pdf route should return promise
Change 462579 merged by jenkins-bot:
[mediawiki/services/chromium-render@master] Html2pdf route should return promise
Mentioned in SAL (#wikimedia-operations) [2018-10-04T13:43:39Z] <pmiazga@deploy1001> Started deploy [proton/deploy@ecb9a0e]: Bugfix:handle undefined response and fix grafana stats (T186748,T201158)
Mentioned in SAL (#wikimedia-operations) [2018-10-04T13:46:35Z] <pmiazga@deploy1001> Finished deploy [proton/deploy@ecb9a0e]: Bugfix:handle undefined response and fix grafana stats (T186748,T201158) (duration: 02m 55s)
Change 460113 merged by Ppchelko:
[mediawiki/services/chromium-render@master] Hygiene: remove redundant .count
I did fix the latency graph: https://grafana.wikimedia.org/dashboard/db/proton?orgId=1&from=now-3h&to=now
However, it uncovered some even bigger issue - all I needed to do is to restart the service manuall via sudo service proton restart. Doing scap deploy --service-restart -f for some reason had no effect.
Indeed, that's really strange. I created T207263: Scap not restarting Proton to that effect.
quoting @Pchelolo from IRC:
Pchelolo: also at this point I'm pretty satisfied with what we got on the dashboard
Pchelolo: what we have now is stable enough
I think we can resolve this task.