Page MenuHomePhabricator

Decide on `service-runner` aggregated prometheus metrics and use of `service` label
Closed, ResolvedPublic

Description

From T238658#5973616 (moving here to reduce noise there)

Sure it might very well be. I am fine with dropping it from statsd-exporter/service-runner itself as long as we expose it in some other way (e.g. a kubernetes label) so that we don't end up breaking all grafana dashboards.

Thanks for the context on service @akosiaris , now it is much more clear in my mind what the status quo is. In the interest of compatibility and time (and picking our battles) I'd say let's go ahead and keep service as it is, for sure the whole conversation is interesting but definitely for another task.

Do we have enough information to decide whether to remove the service label from service-runner or leave as is? I see the benefits of both approaches and am good either way.

I 'd say yes. We can draft a plan to make sure we don't end up breaking dashboards. Below there's such an effort

Taking into account that

An, off the top of my head proposed plan would be:

  1. Deprecate the service label in service-runner
  2. Scope this. I 'd propose just kubernetes services for the reasons above. that get deployed via helm. This foregoes all service-runner services that are in other clusters as they are anyway due to migrate to kubernetes eventually. Those should still use statsd-exporter as they currently do.
  3. Clarify in T242861 whether we want a service label in our charts and what it would stand for.
  4. Add above label to all charts (and scaffolding). This probably includes non-service-runner services as well. The kubernetes service label is probably more generic anyway
  5. Deploy new versions of above said charts for all services.
  6. Proceed with the removal of service label. Services are expected to pick that up as they upgrade dependencies. Non kubernetes services should again have no problem as they don't use the service label.

Thoughts?

Maybe tracking request duration, perhaps greater than a configurable lower bound, may be a job for logging?

I agree, for troubleshooting purposes we should be logging which requests take longer than a configurable time.

For observability purposes and keeping in mind cardinality we could also aggregate status codes by hundreds to get [12345]xx counts too. Maybe for paths we could keep that but aggregate on e.g. first level path only.

Do we have enough information to decide whether or not to exchange this metric for logs when taking longer than a configurable time?

I don't feel like I do. In fact I am not sure at all that exchanging one for the other provides us with enough value. Currently looking at those graphs can give insights that is not at all easy to obtain with kibana (that might change if kibana becomes more usable, easy to navigate/search etc). But currently, I feel I would lose functionality if we went down this way. To the point that I would exchange retention for that level of granularity.

Switching to aggregating by hundreds status code is easy enough. Keep in mind that this will apply to all consumers of hyperswitch, but I don't think this will be a problem.

What are we going to aggregate and where though? Some more information would be most welcome.

Event Timeline

Good idea forking the original task. Thanks for that!

I 'd say yes. We can draft a plan to make sure we don't end up breaking dashboards. Below there's such an effort

Taking into account that

An, off the top of my head proposed plan would be:

  1. Deprecate the service label in service-runner
  2. Scope this. I 'd propose just kubernetes services for the reasons above. that get deployed via helm. This foregoes all service-runner services that are in other clusters as they are anyway due to migrate to kubernetes eventually. Those should still use statsd-exporter as they currently do.
  3. Clarify in T242861 whether we want a service label in our charts and what it would stand for.
  4. Add above label to all charts (and scaffolding). This probably includes non-service-runner services as well. The kubernetes service label is probably more generic anyway
  5. Deploy new versions of above said charts for all services.
  6. Proceed with the removal of service label. Services are expected to pick that up as they upgrade dependencies. Non kubernetes services should again have no problem as they don't use the service label.

Thoughts?

Sounds good to me. Making the service label toggle-able for service-runner and hyperswitch will be really easy if there is a way to detect that it's running in k8s. Is there an environment variable or config option we can inspect for or do we need to wire up a config option?

I don't feel like I do. In fact I am not sure at all that exchanging one for the other provides us with enough value. Currently looking at those graphs can give insights that is not at all easy to obtain with kibana (that might change if kibana becomes more usable, easy to navigate/search etc). But currently, I feel I would lose functionality if we went down this way. To the point that I would exchange retention for that level of granularity.

Unfortunately retention time is a global setting in Prometheus and affects all time series. Upstream recommends not to store unbounded sets of values because each label:value pair, is an additional time series. Having a path label can make sense if there are no path parameters used at all. However, when path parameters enter the mix the path label could become unbounded and can lead to Prometheus running out of disk space faster than anticipated. That said, I have not experienced this so I can only surmise what the potential fallout would be, nor do I know what exactly our tolerance for unique time series is.

Request duration and path are represented in access logs and, if available, tracing data. Kibana can graph numeric data, but it's certainly not as convenient as having it in grafana. This could change, though, as grafana does support ES as a data source. I did notice that this functionality would not work in logstash right now because duration fields are indexed as text fields. We hope to address this as we move forward implementing the common logging schema.

I agree with your point that this data is very valuable. Moving ahead with keeping it until a better storage and rendering solution is available would be agreeable to me, but I would ask @fgiunchedi to help in this decision to operate contrary to recommendations from upstream.

What are we going to aggregate and where though? Some more information would be most welcome.

If I understand correctly, @fgiunchedi proposed to have hyperswitch aggregate on status code hundreds place value. For example: status_code in ["1xx", "2xx", "3xx", "4xx", "5xx"] and not status_code in [200, 201, 302, 401, 413, 404, 502, 503, ...et. al. ]. Essentially doing away with specific status codes and grouping them into "families." This caps the status code label at max five time series.

colewhite triaged this task as Medium priority.Mar 17 2020, 9:45 PM

Good idea forking the original task. Thanks for that!

I 'd say yes. We can draft a plan to make sure we don't end up breaking dashboards. Below there's such an effort

Taking into account that

An, off the top of my head proposed plan would be:

  1. Deprecate the service label in service-runner
  2. Scope this. I 'd propose just kubernetes services for the reasons above. that get deployed via helm. This foregoes all service-runner services that are in other clusters as they are anyway due to migrate to kubernetes eventually. Those should still use statsd-exporter as they currently do.
  3. Clarify in T242861 whether we want a service label in our charts and what it would stand for.
  4. Add above label to all charts (and scaffolding). This probably includes non-service-runner services as well. The kubernetes service label is probably more generic anyway
  5. Deploy new versions of above said charts for all services.
  6. Proceed with the removal of service label. Services are expected to pick that up as they upgrade dependencies. Non kubernetes services should again have no problem as they don't use the service label.

Thoughts?

Sounds good to me. Making the service label toggle-able for service-runner and hyperswitch will be really easy if there is a way to detect that it's running in k8s. Is there an environment variable or config option we can inspect for or do we need to wire up a config option?

Yes. MANY!

KUBERNETES_PORT=tcp://10.192.72.1:443
KUBERNETES_PORT_443_TCP=tcp://10.192.72.1:443
KUBERNETES_PORT_443_TCP_ADDR=10.192.72.1
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_SERVICE_HOST=10.192.72.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443

And of course we can always just add a new one if we feel like it. I would propose in fact we go down that way as we fully control it. Any naming preferences?

I don't feel like I do. In fact I am not sure at all that exchanging one for the other provides us with enough value. Currently looking at those graphs can give insights that is not at all easy to obtain with kibana (that might change if kibana becomes more usable, easy to navigate/search etc). But currently, I feel I would lose functionality if we went down this way. To the point that I would exchange retention for that level of granularity.

Unfortunately retention time is a global setting in Prometheus and affects all time series. Upstream recommends not to store unbounded sets of values because each label:value pair, is an additional time series. Having a path label can make sense if there are no path parameters used at all. However, when path parameters enter the mix the path label could become unbounded and can lead to Prometheus running out of disk space faster than anticipated. That said, I have not experienced this so I can only surmise what the potential fallout would be, nor do I know what exactly our tolerance for unique time series is.

Good point. I guess we don't have to reports metrics with path parameters embedded and can instead just stick with paths?

Request duration and path are represented in access logs and, if available, tracing data. Kibana can graph numeric data, but it's certainly not as convenient as having it in grafana. This could change, though, as grafana does support ES as a data source. I did notice that this functionality would not work in logstash right now because duration fields are indexed as text fields. We hope to address this as we move forward implementing the common logging schema.

I agree with your point that this data is very valuable. Moving ahead with keeping it until a better storage and rendering solution is available would be agreeable to me, but I would ask @fgiunchedi to help in this decision to operate contrary to recommendations from upstream.

FWIW, I 'd say that if we don't allow parameters to expand the reported paths (which I find no reason to do anyway), we are going to keep the set bounded and thus I hope acceptable?

What are we going to aggregate and where though? Some more information would be most welcome.

If I understand correctly, @fgiunchedi proposed to have hyperswitch aggregate on status code hundreds place value. For example: status_code in ["1xx", "2xx", "3xx", "4xx", "5xx"] and not status_code in [200, 201, 302, 401, 413, 404, 502, 503, ...et. al. ]. Essentially doing away with specific status codes and grouping them into "families." This caps the status code label at max five time series.

Ah cool, thanks for explaining it. service-runner already does that, albeit it emits the "[12345]xx" part as an additional metric, so it would mean we just drop the more specific ones. In principle, I am fine with 1xx and 2xx. I would be ok with 3xx as well, although I have found in the past useful to be able to differentiate between 301 and 302 every now and then, but 4xx and 5xx would probably result in loss of visibility for errors. Note btw, that most apps currently don't really use that many statuses and the set is currently at bounded at 10 distinct and 5 aggregated statuses across the entirety of all apps. See https://w.wiki/Kh4. Just dropping "ALL" and the aggregated ones would achieve a not so dismilar effect

And of course we can always just add a new one if we feel like it. I would propose in fact we go down that way as we fully control it. Any naming preferences?

A service-runner config option to selectively enable the service label is reasonable. I'll look into making that happen.

Good point. I guess we don't have to reports metrics with path parameters embedded and can instead just stick with paths?
FWIW, I 'd say that if we don't allow parameters to expand the reported paths (which I find no reason to do anyway), we are going to keep the set bounded and thus I hope acceptable?

TIL hyperswitch is doing this in a sane way already. Rather than submitting metrics for resolved path parameters, it submits for usage of the unresolved path. This makes the path bounded by the routes as they are defined in the OpenAPI spec. This may have been your point all along and I missed it. Sorry about that.

Ah cool, thanks for explaining it. service-runner already does that, albeit it emits the "[12345]xx" part as an additional metric, so it would mean we just drop the more specific ones. In principle, I am fine with 1xx and 2xx. I would be ok with 3xx as well, although I have found in the past useful to be able to differentiate between 301 and 302 every now and then, but 4xx and 5xx would probably result in loss of visibility for errors. Note btw, that most apps currently don't really use that many statuses and the set is currently at bounded at 10 distinct and 5 aggregated statuses across the entirety of all apps. See https://w.wiki/Kh4. Just dropping "ALL" and the aggregated ones would achieve a not so dismilar effect

I agree that we don't gain much by aggregating the status code and can lose some valuable data. I'm good to use the full status code rather than aggregates (current behavior).

And of course we can always just add a new one if we feel like it. I would propose in fact we go down that way as we fully control it. Any naming preferences?

A service-runner config option to selectively enable the service label is reasonable. I'll look into making that happen.

Cool. Let me know the name of the env variable so I start adding it to charts. Thanks!

Good idea forking the original task. Thanks for that!

+1 !

I 'd say yes. We can draft a plan to make sure we don't end up breaking dashboards. Below there's such an effort

Taking into account that

An, off the top of my head proposed plan would be:

  1. Deprecate the service label in service-runner
  2. Scope this. I 'd propose just kubernetes services for the reasons above. that get deployed via helm. This foregoes all service-runner services that are in other clusters as they are anyway due to migrate to kubernetes eventually. Those should still use statsd-exporter as they currently do.
  3. Clarify in T242861 whether we want a service label in our charts and what it would stand for.
  4. Add above label to all charts (and scaffolding). This probably includes non-service-runner services as well. The kubernetes service label is probably more generic anyway
  5. Deploy new versions of above said charts for all services.
  6. Proceed with the removal of service label. Services are expected to pick that up as they upgrade dependencies. Non kubernetes services should again have no problem as they don't use the service label.

Thoughts?

Sounds good to me. Making the service label toggle-able for service-runner and hyperswitch will be really easy if there is a way to detect that it's running in k8s. Is there an environment variable or config option we can inspect for or do we need to wire up a config option?

Sounds good to me as well, the crucial point in my mind being "Clarify in T242861 whether we want a service label in our charts and what it would stand for.", it seems from the task that between application name, release, etc adding a service label would essentially be a duplication of some of those values?

I don't feel like I do. In fact I am not sure at all that exchanging one for the other provides us with enough value. Currently looking at those graphs can give insights that is not at all easy to obtain with kibana (that might change if kibana becomes more usable, easy to navigate/search etc). But currently, I feel I would lose functionality if we went down this way. To the point that I would exchange retention for that level of granularity.

Unfortunately retention time is a global setting in Prometheus and affects all time series. Upstream recommends not to store unbounded sets of values because each label:value pair, is an additional time series. Having a path label can make sense if there are no path parameters used at all. However, when path parameters enter the mix the path label could become unbounded and can lead to Prometheus running out of disk space faster than anticipated. That said, I have not experienced this so I can only surmise what the potential fallout would be, nor do I know what exactly our tolerance for unique time series is.

Request duration and path are represented in access logs and, if available, tracing data. Kibana can graph numeric data, but it's certainly not as convenient as having it in grafana. This could change, though, as grafana does support ES as a data source. I did notice that this functionality would not work in logstash right now because duration fields are indexed as text fields. We hope to address this as we move forward implementing the common logging schema.

For service-runner in k8s use case the logging should be already available in logstash/kibana and I believe access log is available too, thus yes the full paths are available in Kibana already for deep dives. So tl;dr yes if we're reporting url paths already deduplicated and status codes non-aggregated we should be fine I think!

Myself, @akosiaris @colewhite and @Ottomata met today to bikesh^W understand better what service means and other related labels.

Notes from the meeting, please fix/correct as needed:

  • The idea of service is the name of a service we offer to internal and external users. Ideally this label is attached both to metrics and logs.
  • "service" as a name/concept is generally overloaded, k8s is no exception. e.g. does k8s' "service" idea match ours ?
  • We also have app in charts, although that is the software that runs the service, examples below.
  • Should the application add service among its labels for Prometheus metrics?
    • Filippo: no, the application shouldn't know its service. It will come from either static configuration (e.g. Prometheus in production) or other Prometheus discovery (e.g. k8s metadata)
    • Andrew: it would be nice to preview metrics as close as what would result in production, for this we could make service optional for service-runner, e.g. based on environment variables to be activated in testing.

Examples of chart / service / app

CP:
chart: change-prop
app: change-prop
service: change-prop

JQ:
    chart: jobqueue
    app: change-prop
    service: cpjobqueue

Another example:

chart: eventgate
app: eventgate-wikimedia # (or maybe just eventgate?)
service: eventgate-analytics


chart: eventgate
app: eventgate-wikimedia # (or maybe just eventgate?)
service: eventgate-main
# etc.

Should the application add service among its labels for Prometheus metrics?

I'd argue that it is necessary for the work that Cole is doing in service-runner for the app to know the service name. Filippo has made a point that it might not be necessary for prometheus metrics, as the service label can be added by the prometheus server side configs. However, the metrics abstraction makes the service-runner metrics API look the same for both statsd and prometheus, and statsd needs the service name in its metric names (since it doesn't have labels). We could manually avoid setting it as a prometheus label, but that seems a bit awkward and would add some special case code needed to be added to service-runner to do this.

service-runner should be WMF infrastructure agnostic, so I think adding the service name label here makes sense. We can and should make k8s exporter and prometheus server added labels override any app provided ones.

The problem the service label solves is that the service label is for aggregation on "instance-level" metrics that may be useful to query for a multi-instance view.

In the libraries, nodejs_ and express_ prefixed metrics get the service label only because we aren't guaranteeing the prefix will match what the developer configures for a metric name.

This could be solved by making the service-runner PrometheusClient prefix every metric with the service name. By doing this we eliminate the need for the label altogether.

I've put together a couple PRs (service-template-node, service-runner) for this functionality and incorporated this feedback into the hyperswitch PR.

Let me know what you think?
On further discussion, we decided against this.

Another example:

chart: eventgate
app: eventgate-wikimedia # (or maybe just eventgate?)
service: eventgate-analytics


chart: eventgate
app: eventgate-wikimedia # (or maybe just eventgate?)
service: eventgate-main
# etc.

Should the application add service among its labels for Prometheus metrics?

I'd argue that it is necessary for the work that Cole is doing in service-runner for the app to know the service name. Filippo has made a point that it might not be necessary for prometheus metrics, as the service label can be added by the prometheus server side configs. However, the metrics abstraction makes the service-runner metrics API look the same for both statsd and prometheus, and statsd needs the service name in its metric names (since it doesn't have labels). We could manually avoid setting it as a prometheus label, but that seems a bit awkward and would add some special case code needed to be added to service-runner to do this.

service-runner should be WMF infrastructure agnostic, so I think adding the service name label here makes sense. We can and should make k8s exporter and prometheus server added labels override any app provided ones.

I think we'll have to agree to disagree on this :) I don't get why we'd need to special case service-runner with service while other k8s services don't have it (e.g. golang services), and we seem to be fine without it?

I believe the goal is to have service standard for all k8s services regardless of their language/framework, and IMHO it should be attached by Prometheus from k8s labels during discovery, without the services themselves knowing.

At any rate, I won't insist further and I'm advising against it. Having said that I'm ok with whichever consensus we'll reach.

e.g. golang services), and we seem to be fine without it?

If they were using statsd, they'd wouldn't be fine without it, right? If we weren't using k8s, they wouldn't be fine without it?

service-runner is a generic nodejs library that can and is run without prometheus and without k8s.

I believe the goal is to have service standard for all k8s services regardless of their language/framework, and IMHO it should be attached by Prometheus from k8s labels during discovery,

I agree with this! And I think we can do both. service-runner can still know its service name and so can k8s. k8s should set this prometheus label and it should override whatever service-runner might set.

@akosiaris in this PR, we selectively disable the service label based on if the environment variable METRICS_SERVICE_LABEL_ENABLED is "false".

e.g. golang services), and we seem to be fine without it?

If they were using statsd, they'd wouldn't be fine without it, right? If we weren't using k8s, they wouldn't be fine without it?

service-runner is a generic nodejs library that can and is run without prometheus and without k8s.

In the non-k8s case additional labels would come from Prometheus' configuration, like e.g. we do in production nowadays with cluster that is supplied by Puppet(DB) discovery.

re: using statsd in general, in my mind it does beg the question of how long to keep supporting statsd, but that's a much bigger discussion and way out of scope for this task!

I believe the goal is to have service standard for all k8s services regardless of their language/framework, and IMHO it should be attached by Prometheus from k8s labels during discovery,

I agree with this! And I think we can do both. service-runner can still know its service name and so can k8s. k8s should set this prometheus label and it should override whatever service-runner might set.

+1 on doing both, I think what @colewhite proposed in https://github.com/wikimedia/service-runner/pull/227 makes sense as a toggle.

Change to heap metrics merged into service-runner/prometheus_metrics branch. Thanks @Pchelolo!

The service label can be disabled by setting METRICS_SERVICE_LABEL_ENABLED to 'false'.

We'll need a review+merge of https://github.com/wikimedia/service-template-node/pull/131 to complete this task.

colewhite claimed this task.

Change to router metrics in service-template-node merged.