Page MenuHomePhabricator

cxserver missing important metrics after service-runner 2.8.1 upgrade
Open, MediumPublic

Description

After upgrade to service-runner 2.8.1, cxserver is missing important metrics, so we can't upgrade latest version to the Production.

To see full list,

see difference between curl http://10.64.65.205:9102/metrics (Production) to curl http://10.64.75.121:9090/metrics (Staging).

Event Timeline

I'm not sure where we lost the metrics, but we are missing following metrics after upgrade. @akosiaris have we seen similar issue elsewhere after upgrade to Prometheus metrics?

  1. cxserver_robots_txt_GET_200 Metric autogenerated by statsd_exporter.
  2. cxserver_robots_txt_GET_2xx Metric autogenerated by statsd_exporter.
  3. cxserver_robots_txt_GET_ALL Metric autogenerated by statsd_exporter.
  4. go_gc_duration_seconds A summary of the GC invocation durations.
  5. go_goroutines Number of goroutines that currently exist.
  6. go_info Information about the Go environment.
  7. go_memstats_alloc_bytes Number of bytes allocated and still in use.
  8. go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
  9. go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
  10. go_memstats_frees_total Total number of frees.
  11. go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
  12. go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
  13. go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
  14. go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
  15. go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
  16. go_memstats_heap_objects Number of allocated objects.
  17. go_memstats_heap_released_bytes Number of heap bytes released to OS.
  18. go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
  19. go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
  20. go_memstats_lookups_total Total number of pointer lookups.
  21. go_memstats_mallocs_total Total number of mallocs.
  22. go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
  23. go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
  24. go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
  25. go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
  26. go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
  27. go_memstats_other_sys_bytes Number of bytes used for other system allocations.
  28. go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
  29. go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
  30. go_memstats_sys_bytes Number of bytes obtained from system.
  31. go_threads Number of OS threads created.
  32. http_request_duration_microseconds The HTTP request latencies in microseconds.
  33. http_request_size_bytes The HTTP request sizes in bytes.
  34. http_requests_total Total number of HTTP requests made.
  35. http_response_size_bytes The HTTP response sizes in bytes.
  36. process_cpu_seconds_total Total user and system CPU time spent in seconds.
  37. process_max_fds Maximum number of open file descriptors.
  38. process_open_fds Number of open file descriptors.
  39. process_resident_memory_bytes Resident memory size in bytes.
  40. process_start_time_seconds Start time of the process since unix epoch in seconds.
  41. process_virtual_memory_bytes Virtual memory size in bytes.
  42. process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
  43. service_runner_gc_microseconds Metric autogenerated by statsd_exporter.
  44. service_runner_request_duration_seconds Metric autogenerated by statsd_exporter.
  45. service_runner_rss_heap_kilobytes Metric autogenerated by statsd_exporter.
  46. service_runner_total_heap_kilobytes Metric autogenerated by statsd_exporter.
  47. service_runner_used_heap_kilobytes Metric autogenerated by statsd_exporter.
  48. statsd_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which statsd_exporter was built.
  49. statsd_exporter_events_total The total number of StatsD events seen.
  50. statsd_exporter_events_unmapped_total The total number of StatsD events no mapping was found for.
  51. statsd_exporter_lines_total The total number of StatsD lines received.
  52. statsd_exporter_loaded_mappings The current number of configured metric mappings.
  53. statsd_exporter_samples_total The total number of StatsD samples received.
  54. statsd_exporter_tag_errors_total The number of errors parsign DogStatsD tags.
  55. statsd_exporter_tags_total The total number of DogStatsD tags processed.
  56. statsd_exporter_tcp_connection_errors_total The number of errors encountered reading from TCP.
  57. statsd_exporter_tcp_connections_total The total number of TCP connections handled.
  58. statsd_exporter_tcp_too_long_lines_total The number of lines discarded due to being too long.
  59. statsd_exporter_udp_packets_total The total number of StatsD packets received over UDP.

cxserver_robots_txt_GET_200 Metric autogenerated by statsd_exporter.
cxserver_robots_txt_GET_2xx Metric autogenerated by statsd_exporter.
cxserver_robots_txt_GET_ALL Metric autogenerated by statsd_exporter.

These are robots.txt access, not triggered by cxserver, but servicerunner. Not sure this is an indented side effect of service runner 2.8 upgrade or not. @colewhite, @Pchelolo do you know?

I think it is important to clarify that the change in T271711: Update cxserver to service-runner 2.8.1 was just using service runner and its new metrics interface as per the migration guide. I could not find any documentation about what are new or removed metrics in service runner 2.8. We may need help from servicerunner developers to find out that.

I would assume that the this is not a case of wrong tooling(server configurations, ports etc) since we are not loosing every message, but a small subset. Please correct me if that is not the case.

@KartikMistry, all the metrics that are reported by cxserver codebase are coming out correctly right? Mainly Machine Translation usage, access to our routes, internal errors etc.

@KartikMistry, all the metrics that are reported by cxserver codebase are coming out correctly right? Mainly Machine Translation usage, access to our routes, internal errors etc.

Yes. That's correct.

translate_*_charcount, translate_*_200, translate_*_500 are correctly generated.

I'm not sure where we lost the metrics, but we are missing following metrics after upgrade. @akosiaris have we seen similar issue elsewhere after upgrade to Prometheus metrics?

Not that I know of. T277857 is somewhat similar, but maybe not related.

Some quick rundown of the various metrics:

These while not strictly necessary could be useful to have around

  1. cxserver_robots_txt_GET_200 Metric autogenerated by statsd_exporter.
  2. cxserver_robots_txt_GET_2xx Metric autogenerated by statsd_exporter.
  3. cxserver_robots_txt_GET_ALL Metric autogenerated by statsd_exporter.

Those are ok to not have around, they are internal to prometheus-statsd-exporter (we don't really rely on them anyway in any way that I know of)

> # go_gc_duration_seconds A summary of the GC invocation durations.
> # go_goroutines Number of goroutines that currently exist.
> # go_info Information about the Go environment.
> # go_memstats_alloc_bytes Number of bytes allocated and still in use.
> # go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
> # go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
> # go_memstats_frees_total Total number of frees.
> # go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
> # go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
> # go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
> # go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
> # go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
> # go_memstats_heap_objects Number of allocated objects.
> # go_memstats_heap_released_bytes Number of heap bytes released to OS.
> # go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
> # go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
> # go_memstats_lookups_total Total number of pointer lookups.
> # go_memstats_mallocs_total Total number of mallocs.
> # go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
> # go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
> # go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
> # go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
> # go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
> # go_memstats_other_sys_bytes Number of bytes used for other system allocations.
> # go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
> # go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
> # go_memstats_sys_bytes Number of bytes obtained from system.
> # go_threads Number of OS threads created.
> # http_request_duration_microseconds The HTTP request latencies in microseconds.
> # http_request_size_bytes The HTTP request sizes in bytes.
> # http_requests_total Total number of HTTP requests made.
> # http_response_size_bytes The HTTP response sizes in bytes.
> # process_cpu_seconds_total Total user and system CPU time spent in seconds.
> # process_max_fds Maximum number of open file descriptors.
> # process_open_fds Number of open file descriptors.
> # process_resident_memory_bytes Resident memory size in bytes.
> # process_start_time_seconds Start time of the process since unix epoch in seconds.
> # process_virtual_memory_bytes Virtual memory size in bytes.
> # process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.

These are service runner specific and we have some of them but other are missing. Most important is service_runner_request_duration_seconds. This probably needs some cooperation between cxserver and service-runner to happen work fine (cxserver using the correct library call of service-runner).

  1. service_runner_gc_microseconds Metric autogenerated by statsd_exporter.
  2. service_runner_request_duration_seconds Metric autogenerated by statsd_exporter.
  3. service_runner_rss_heap_kilobytes Metric autogenerated by statsd_exporter.
  4. service_runner_total_heap_kilobytes Metric autogenerated by statsd_exporter.
  5. service_runner_used_heap_kilobytes Metric autogenerated by statsd_exporter.

These are also internal to statsd_exporter, it's ok to not have them after the upgrade.

> # statsd_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which statsd_exporter was built.
> # statsd_exporter_events_total The total number of StatsD events seen.
> # statsd_exporter_events_unmapped_total The total number of StatsD events no mapping was found for.
> # statsd_exporter_lines_total The total number of StatsD lines received.
> # statsd_exporter_loaded_mappings The current number of configured metric mappings.
> # statsd_exporter_samples_total The total number of StatsD samples received.
> # statsd_exporter_tag_errors_total The number of errors parsign DogStatsD tags.
> # statsd_exporter_tags_total The total number of DogStatsD tags processed.
> # statsd_exporter_tcp_connection_errors_total The number of errors encountered reading from TCP.
> # statsd_exporter_tcp_connections_total The total number of TCP connections handled.
> # statsd_exporter_tcp_too_long_lines_total The number of lines discarded due to being too long.
> # statsd_exporter_udp_packets_total The total number of StatsD packets received over UDP.

So, getting service_runner_request_duration_seconds to work will probably fix this.

Did you update the entire service-template-node or just updated the service-runner? You need this patch to be present in service-template-node: https://github.com/wikimedia/service-template-node/pull/127

Did you after that followup the upgrade guide? https://github.com/wikimedia/service-runner/blob/master/doc/2.7-2.8_Migration_Guide.md

@Pchelolo We just updated the service-runner library to 2.8 and then followed the migration guide to update the metrics reporting interface(https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/670441/). cxserver is an existing project(since 2014), so the template project cannot be used as such, but can be referenced for these changes.

https://github.com/wikimedia/service-template-node/pull/127 is confusing because it does not upgrade to service-runner 2.8 but uses a branch #prometheus_metrics. I see the Pull request was merged to that branch and not to master.

If you are suggesting to refer https://github.com/wikimedia/service-template-node/pull/127 do same change in cxserver, please confirm, we can do it.

Oh right, it's still in the branch in the template. You can use this code https://github.com/wikimedia/service-template-node/pull/148 to see what needs to be done.

Alternatively, you can still upgrade your service onto newer service-template - there's a lot of goodies that come with it. You can follow https://www.mediawiki.org/wiki/ServiceTemplateNode/Updating to do that - it will require quite a lot of rebasing and conflict resolution, but it's a more automatic way of being up to date.

Oh right, it's still in the branch in the template. You can use this code https://github.com/wikimedia/service-template-node/pull/148 to see what needs to be done.

I tried it and there was js errors with that PR. I have given details there. Will wait till PR is reviewed, tested and merged.

Alternatively, you can still upgrade your service onto newer service-template - there's a lot of goodies that come with it. You can follow https://www.mediawiki.org/wiki/ServiceTemplateNode/Updating to do that - it will require quite a lot of rebasing and conflict resolution, but it's a more automatic way of being up to date.

To be honest, it looks a complicated workflow. If servicerunner releases comes with a changelog and a brief migration guide that can even point to example update in ServiceTemplateNode it will help us more. Since cxserver has been in active development for last 6+ years, it has diverged significantly from the template code for clean merges and rebases with it. Spacing overrides on top of "wikimedia/server" eslint rules in ServiceTemplateNode makes it evern hard for projects that tries to follow "wikimedia/server" with little overrides.

@santhosh I could not replicate the problem. Please advise on how to replicate?

Here is the process I followed:

$ cd cxserver
$ cp ../config.yaml config.yaml
$ git apply ../util.js.patch
$ docker pull node:10-buster
$ docker run -it --rm --net host --user node -v $(pwd):/home/node/src --workdir /home/node/src node:10-buster /bin/bash
$ npm install
$ npm run test
$ npm start dev
$ curl -s -i localhost:8080/v2?doc # 200
$ curl -s -i localhost:8080/_info # 200
$ curl -s -i localhost:8080/_info/name # 200
$ curl -s -i localhost:9090 | grep express # 200 - expected metrics present

So, getting service_runner_request_duration_seconds to work will probably fix this.

The metric was renamed from service_runner_request_duration_seconds to express_router_request_duration_seconds_bucket per the request for histograms.

The service_runner prefix to the metric name was ultimately unclear due to it being a part of service-template-node, and being more closely aligned with metrics generated during request handling in the express web framework. Ideally, this prefix should be replaced with the service name (cxserver) since this metric is generated by the service itself and is part of the service's source code. This distinction is unclear due to the expected workflow, however. Service-runner applications are normally expected to re-apply service-template-node on each service-runner upgrade.

The metric can be renamed by modifying util.js replacing the express_router prefix with a more desirable prefix. If wanting to remove the _bucket suffix, the metric should be changed from a histogram to a counter or gauge.

@santhosh I could not replicate the problem. Please advise on how to replicate?

Same steps as you followed, except, look at server logs after first curl. Here is full console output from your steps P15088 servicerunner-PR-148

$ cd cxserver
$ cp ../config.yaml config.yaml
$ git apply ../util.js.patch
$ docker pull node:10-buster
$ docker run -it --rm --net host --user node -v $(pwd):/home/node/src --workdir /home/node/src node:10-buster /bin/bash
$ npm install
$ npm run test
$ npm start dev
$ curl -s -i localhost:8080/v2?doc # 200

Take a look at server logs now

grep express will give result because

curl -s -i localhost:9090 | grep express
# HELP express_router_request_duration_seconds request duration handled by router in seconds
# TYPE express_router_request_duration_seconds histogram

But there is no actual metrics logged.

Same steps as you followed, except, look at server logs after first curl. Here is full console output from your steps P15088 servicerunner-PR-148

Thank you for the logs! I see possible differences in config.yaml and in node_modules.

The metric prefix in my environment is cxserver possibly indicating a configuration option in config.yaml or elsewhere overwriting the metric prefix. Would you kindly paste your config.yaml?

The stack trace from the logs indicates an error generated by service-runner version 2.8.0. Would you kindly try refreshing your cached copy of cxserver's dependencies?

Would you kindly paste your config.yaml?

This is my metrics configuration. It should be ok since prometheus logs are coming out except this new one added by utils.js patch

metrics:
  - type: log
    name: servicerunner
  - type: prometheus
    host: localhost
    port: 9090
  - type: statsd
    host: localhost
    port: 8125

The stack trace from the logs indicates an error generated by service-runner version 2.8.0.

This is from latest cxserver checkout which has dependency for service runner 2.8.1

Would you kindly try refreshing your cached copy of cxserver's dependencies?

The whole point of using docker is to avoid all these issues right? I used node:10-buster image just like you gave in the steps.

The whole point of using docker is to avoid all these issues right? I used node:10-buster image just like you gave in the steps.

Indeed, but the -v $(pwd):/home/node/src flag will bind mount the source directory and persist node_modules to disk. This makes npm install invocations survive docker runs.

This is my metrics configuration. It should be ok since prometheus logs are coming out except this new one added by utils.js patch

Applying your metrics configuration surfaced the problem. The log client appears to be mutating the labels and that mutation leaks to other clients. Fix submitted: https://github.com/wikimedia/service-runner/pull/235

Thank you for the assistance!

Change 677421 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] Update servicerunner to 2.8.2

https://gerrit.wikimedia.org/r/677421

Change 677421 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Update servicerunner to 2.8.2

https://gerrit.wikimedia.org/r/677421

Change 677557 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2021-04-07-062518-production

https://gerrit.wikimedia.org/r/677557

Change 677557 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2021-04-07-062518-production

https://gerrit.wikimedia.org/r/677557