Page MenuHomePhabricator

Monitoring and data collection for session storage service
Closed, ResolvedPublic

Description

The session storage service will need to export a sensible set of metrics. Grafana dashboards also need to be created that visualize data from these metrics and provide operational insight.

Additionally (as is convention for services hosted at the WMF), Icinga checks that perform a synthetic transaction according to a (discovered) specification should be possible.

See also: operations/software/service-checker

Event Timeline

Eevans triaged this task as Medium priority.Nov 8 2018, 8:45 PM
Eevans created this task.

Change 484315 had a related patch set uploaded (by Eevans; owner: Eevans):
[mediawiki/services/kask@master] Basic prometheus metrics support

https://gerrit.wikimedia.org/r/484315

For posterity sake, here are the Prometheus metrics provided by default:

1HTTP/1.1 200 OK
2Content-Length: 5007
3Content-Type: text/plain; version=0.0.4
4Date: Tue, 15 Jan 2019 13:44:03 GMT
5
6# HELP go_gc_duration_seconds A summary of the GC invocation durations.
7# TYPE go_gc_duration_seconds summary
8go_gc_duration_seconds{quantile="0"} 0
9go_gc_duration_seconds{quantile="0.25"} 0
10go_gc_duration_seconds{quantile="0.5"} 0
11go_gc_duration_seconds{quantile="0.75"} 0
12go_gc_duration_seconds{quantile="1"} 0
13go_gc_duration_seconds_sum 0
14go_gc_duration_seconds_count 0
15# HELP go_goroutines Number of goroutines that currently exist.
16# TYPE go_goroutines gauge
17go_goroutines 13
18# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
19# TYPE go_memstats_alloc_bytes gauge
20go_memstats_alloc_bytes 900928
21# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
22# TYPE go_memstats_alloc_bytes_total counter
23go_memstats_alloc_bytes_total 900928
24# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
25# TYPE go_memstats_buck_hash_sys_bytes gauge
26go_memstats_buck_hash_sys_bytes 2473
27# HELP go_memstats_frees_total Total number of frees.
28# TYPE go_memstats_frees_total counter
29go_memstats_frees_total 500
30# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
31# TYPE go_memstats_gc_sys_bytes gauge
32go_memstats_gc_sys_bytes 131072
33# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
34# TYPE go_memstats_heap_alloc_bytes gauge
35go_memstats_heap_alloc_bytes 900928
36# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
37# TYPE go_memstats_heap_idle_bytes gauge
38go_memstats_heap_idle_bytes 57344
39# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
40# TYPE go_memstats_heap_inuse_bytes gauge
41go_memstats_heap_inuse_bytes 1.581056e+06
42# HELP go_memstats_heap_objects Number of allocated objects.
43# TYPE go_memstats_heap_objects gauge
44go_memstats_heap_objects 8709
45# HELP go_memstats_heap_released_bytes_total Total number of heap bytes released to OS.
46# TYPE go_memstats_heap_released_bytes_total counter
47go_memstats_heap_released_bytes_total 0
48# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
49# TYPE go_memstats_heap_sys_bytes gauge
50go_memstats_heap_sys_bytes 1.6384e+06
51# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
52# TYPE go_memstats_last_gc_time_seconds gauge
53go_memstats_last_gc_time_seconds 0
54# HELP go_memstats_lookups_total Total number of pointer lookups.
55# TYPE go_memstats_lookups_total counter
56go_memstats_lookups_total 39
57# HELP go_memstats_mallocs_total Total number of mallocs.
58# TYPE go_memstats_mallocs_total counter
59go_memstats_mallocs_total 9209
60# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
61# TYPE go_memstats_mcache_inuse_bytes gauge
62go_memstats_mcache_inuse_bytes 4800
63# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
64# TYPE go_memstats_mcache_sys_bytes gauge
65go_memstats_mcache_sys_bytes 16384
66# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
67# TYPE go_memstats_mspan_inuse_bytes gauge
68go_memstats_mspan_inuse_bytes 27360
69# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
70# TYPE go_memstats_mspan_sys_bytes gauge
71go_memstats_mspan_sys_bytes 32768
72# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
73# TYPE go_memstats_next_gc_bytes gauge
74go_memstats_next_gc_bytes 4.194304e+06
75# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
76# TYPE go_memstats_other_sys_bytes gauge
77go_memstats_other_sys_bytes 1.066583e+06
78# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
79# TYPE go_memstats_stack_inuse_bytes gauge
80go_memstats_stack_inuse_bytes 458752
81# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
82# TYPE go_memstats_stack_sys_bytes gauge
83go_memstats_stack_sys_bytes 458752
84# HELP go_memstats_sys_bytes Number of bytes obtained by system. Sum of all system allocations.
85# TYPE go_memstats_sys_bytes gauge
86go_memstats_sys_bytes 3.346432e+06
87# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
88# TYPE process_cpu_seconds_total counter
89process_cpu_seconds_total 0.02
90# HELP process_max_fds Maximum number of open file descriptors.
91# TYPE process_max_fds gauge
92process_max_fds 1024
93# HELP process_open_fds Number of open file descriptors.
94# TYPE process_open_fds gauge
95process_open_fds 11
96# HELP process_resident_memory_bytes Resident memory size in bytes.
97# TYPE process_resident_memory_bytes gauge
98process_resident_memory_bytes 8.876032e+06
99# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
100# TYPE process_start_time_seconds gauge
101process_start_time_seconds 1.54755983088e+09
102# HELP process_virtual_memory_bytes Virtual memory size in bytes.
103# TYPE process_virtual_memory_bytes gauge
104process_virtual_memory_bytes 4.645888e+08

Documentation pointers on Prometheus' data model, metric types, etc: https://prometheus.io/docs/concepts/data_model/ https://prometheus.io/docs/concepts/metric_types/ https://prometheus.io/docs/practices/naming/

Also wrt percentiles there's a change in how we should aggregate timings across service instances, compared to statsd, I've started documenting it here (in the context of k8s but applies generally) and feedback/improvements are very much welcome https://wikitech.wikimedia.org/wiki/Prometheus/statsd_k8s#Global_aggregation_considerations

Change 484315 merged by Eevans:
[mediawiki/services/kask@master] Basic prometheus metrics support

https://gerrit.wikimedia.org/r/484315

Change 486502 had a related patch set uploaded (by Clarakosi; owner: Clarakosi):
[mediawiki/services/kask@master] HTTP Prometheus metrics

https://gerrit.wikimedia.org/r/486502

Change 486502 merged by Eevans:
[mediawiki/services/kask@master] HTTP Prometheus metrics

https://gerrit.wikimedia.org/r/486502

Change 493249 had a related patch set uploaded (by Eevans; owner: Eevans):
[mediawiki/services/kask@master] Implement a /healthz endpoint for k8s readiness probe

https://gerrit.wikimedia.org/r/493249

Change 493249 merged by Clarakosi:
[mediawiki/services/kask@master] Implement a /healthz endpoint for k8s readiness probe

https://gerrit.wikimedia.org/r/493249

Eevans renamed this task from Metrics for session storage service to Monitoring and data collection for session storage service.Mar 7 2019, 5:10 PM
Eevans updated the task description. (Show Details)
Eevans added subscribers: jijiki, akosiaris.

Change 497848 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] prometheus: collect session storaage Cassandra metrics

https://gerrit.wikimedia.org/r/497848

Change 497848 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: collect session storage Cassandra metrics

https://gerrit.wikimedia.org/r/497848

Change 507397 had a related patch set uploaded (by Eevans; owner: Eevans):
[mediawiki/services/kask@master] [WIP] Serve OpenAPI specification

https://gerrit.wikimedia.org/r/507397