Monitor query / request concurrency on Blazegraph
Closed, ResolvedPublic

Description

To be able to limit CPU consumption of Blazegraph (T206108), we need to better understand the factors involved. Query and / or request concurrency might be a good indicator of resource consumption.

Possible ways to measure it:

  1. Instrument Jetty to publish this metric over JMX, consumed by the current Prometheus JMX exporter
  2. Instrument Blazegraph (via a custom Servlet filter, or via Dropwizard Metrics) to publish HTTP request concurrency over JMX, consumed by the current Prometheus JMX exporter
  3. Publish the metric from Nginx via stub_status and nginx-prometheus-exporter
  4. Collect metric already exposed by blazegraph, adding them to our prometheus-blazegraph-exporter
Gehel created this task.Oct 3 2018, 11:48 AM
Gehel triaged this task as High priority.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 3 2018, 11:48 AM
Gehel added a comment.EditedOct 3 2018, 12:45 PM

Jetty

Jetty configuration is insane! Also, given that we use jetty-runner to run Jetty, we would need to customize the classpath to add JMX support, and substantially complexify the way we run Jetty. I'm not sure it is worth adding complexity.

For more context:

jetty-runner is a shaded jar containing the minimal subset of classes required to run jetty (jetty itself, javax.sevlet.*, javax.el.*, ...). This supports only a minimal Jetty configuration. To enable JMX reporting, we would need to add at least jetty-jmx and potentially other supporting jars, depending on the functions we want to monitor. Note that those jars are present in the blazegraph war file (and probably unused, not a great idea), but the war is loaded in a webapp specific classloader and not available from Jetty itself.

Instrument Blazegraph

Fairly trivial to implement. Instrumenting at servlet webapp level is a common way to collect metric independently from the application server.

Nginx

This would require packaging prometheus-nginx-exporter, and writing the appropriate integration in puppet. It would only measure traffic going through nginx, so we would be missing traffic related to updater. Collecting metrics at nginx level could be interesting for other applications as well.

Existing Blazegraph metrics

Blazegraph exposes the number of running queries (BigdataRDFContext.getQueries() through the StatusServlet. It does not seem to expose the same metrics through the usual CountersServlet that we use to collect statistics. The CountersServlet exposes some similar metrics (/ Journal / Concurrency Manager / Read Service / Average Active Count for example) which are based on a moving average (exponentially decaying) of the number of active thread in the various journal executor services. It also exposes / Query Engine / operatorActiveCount, which is based on the number of running ChunkedRunningQuery.

Conclusion

We should probably start by collecting the various active counters already exposed by Blazegraph, and see if the data collected makes any sense, and if it correlates with high load. Once we have a better understanding, we can refine.

Gehel updated the task description. (Show Details)Oct 3 2018, 8:03 PM

Change 464854 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/debs/prometheus-blazegraph-exporter@master] prometheus-blazegraph-exporter: added Query and Concurrency related counters

https://gerrit.wikimedia.org/r/464854

Change 464854 merged by Gehel:
[operations/debs/prometheus-blazegraph-exporter@master] prometheus-blazegraph-exporter: added Query and Concurrency related counters

https://gerrit.wikimedia.org/r/464854

Mentioned in SAL (#wikimedia-operations) [2018-10-09T10:35:53Z] <onimisionipe> deploying prometheus-blazegraph-exporter 0.6 on all wdqs clusters - T206123

I have added a counter for / Query Engine / runningQueriesCount - can be seen at wdq9 now and on other servers after deployment. Can we get prometeus sampling it and add it to dashboards?

Restricted Application added a project: Wikidata. · View Herald TranscriptWed, Nov 14, 7:44 PM

Change 473707 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/debs/prometheus-blazegraph-exporter@master] prometheus-blazegraph-exporter: added runningQueriesCount metric

https://gerrit.wikimedia.org/r/473707

Change 473707 merged by Gehel:
[operations/debs/prometheus-blazegraph-exporter@master] prometheus-blazegraph-exporter: added runningQueriesCount metric

https://gerrit.wikimedia.org/r/473707

Mentioned in SAL (#wikimedia-operations) [2018-11-15T17:20:56Z] <gehel> upgrade prometheus-blazegraph-exporter on all wdqs nodes - T206123

Gehel added a comment.Thu, Nov 15, 5:33 PM

Looks like the new metric is flowing to prometheus

Smalyshev closed this task as Resolved.Fri, Nov 16, 9:59 PM