Page MenuHomePhabricator

Port elasticsearch metrics to Prometheus
Closed, ResolvedPublic

Description

I'm testing https://github.com/justwatchcom/elasticsearch_exporter as an alternative to the diamond collector we have, example output from deployment-logstash2.

1# HELP elasticsearch_breakers_estimated_size_bytes Estimated size in bytes of breaker
2# TYPE elasticsearch_breakers_estimated_size_bytes gauge
3elasticsearch_breakers_estimated_size_bytes{breaker="fielddata",cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
4elasticsearch_breakers_estimated_size_bytes{breaker="in_flight_requests",cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
5elasticsearch_breakers_estimated_size_bytes{breaker="parent",cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
6elasticsearch_breakers_estimated_size_bytes{breaker="request",cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
7# HELP elasticsearch_breakers_limit_size_bytes Limit size in bytes for breaker
8# TYPE elasticsearch_breakers_limit_size_bytes gauge
9elasticsearch_breakers_limit_size_bytes{breaker="fielddata",cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 3.087217459e+09
10elasticsearch_breakers_limit_size_bytes{breaker="in_flight_requests",cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 5.145362432e+09
11elasticsearch_breakers_limit_size_bytes{breaker="parent",cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 3.601753702e+09
12elasticsearch_breakers_limit_size_bytes{breaker="request",cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 3.087217459e+09
13# HELP elasticsearch_breakers_tripped tripped for breaker
14# TYPE elasticsearch_breakers_tripped gauge
15elasticsearch_breakers_tripped{breaker="fielddata",cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
16elasticsearch_breakers_tripped{breaker="in_flight_requests",cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
17elasticsearch_breakers_tripped{breaker="parent",cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
18elasticsearch_breakers_tripped{breaker="request",cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
19# HELP elasticsearch_cluster_health_active_primary_shards Tthe number of primary shards in your cluster. This is an aggregate total across all indices.
20# TYPE elasticsearch_cluster_health_active_primary_shards gauge
21elasticsearch_cluster_health_active_primary_shards{cluster="labs-logstash-eqiad"} 42
22# HELP elasticsearch_cluster_health_active_shards Aggregate total of all shards across all indices, which includes replica shards.
23# TYPE elasticsearch_cluster_health_active_shards gauge
24elasticsearch_cluster_health_active_shards{cluster="labs-logstash-eqiad"} 42
25# HELP elasticsearch_cluster_health_delayed_unassigned_shards Shards delayed to reduce reallocation overhead
26# TYPE elasticsearch_cluster_health_delayed_unassigned_shards gauge
27elasticsearch_cluster_health_delayed_unassigned_shards{cluster="labs-logstash-eqiad"} 0
28# HELP elasticsearch_cluster_health_initializing_shards Count of shards that are being freshly created.
29# TYPE elasticsearch_cluster_health_initializing_shards gauge
30elasticsearch_cluster_health_initializing_shards{cluster="labs-logstash-eqiad"} 0
31# HELP elasticsearch_cluster_health_json_parse_failures Number of errors while parsing JSON.
32# TYPE elasticsearch_cluster_health_json_parse_failures counter
33elasticsearch_cluster_health_json_parse_failures 0
34# HELP elasticsearch_cluster_health_number_of_data_nodes Number of data nodes in the cluster.
35# TYPE elasticsearch_cluster_health_number_of_data_nodes gauge
36elasticsearch_cluster_health_number_of_data_nodes{cluster="labs-logstash-eqiad"} 1
37# HELP elasticsearch_cluster_health_number_of_in_flight_fetch The number of ongoing shard info requests.
38# TYPE elasticsearch_cluster_health_number_of_in_flight_fetch gauge
39elasticsearch_cluster_health_number_of_in_flight_fetch{cluster="labs-logstash-eqiad"} 0
40# HELP elasticsearch_cluster_health_number_of_nodes Number of nodes in the cluster.
41# TYPE elasticsearch_cluster_health_number_of_nodes gauge
42elasticsearch_cluster_health_number_of_nodes{cluster="labs-logstash-eqiad"} 1
43# HELP elasticsearch_cluster_health_number_of_pending_tasks Cluster level changes which have not yet been executed
44# TYPE elasticsearch_cluster_health_number_of_pending_tasks gauge
45elasticsearch_cluster_health_number_of_pending_tasks{cluster="labs-logstash-eqiad"} 0
46# HELP elasticsearch_cluster_health_relocating_shards The number of shards that are currently moving from one node to another node.
47# TYPE elasticsearch_cluster_health_relocating_shards gauge
48elasticsearch_cluster_health_relocating_shards{cluster="labs-logstash-eqiad"} 0
49# HELP elasticsearch_cluster_health_status Whether all primary and replica shards are allocated.
50# TYPE elasticsearch_cluster_health_status gauge
51elasticsearch_cluster_health_status{cluster="labs-logstash-eqiad",color="green"} 0
52elasticsearch_cluster_health_status{cluster="labs-logstash-eqiad",color="red"} 0
53elasticsearch_cluster_health_status{cluster="labs-logstash-eqiad",color="yellow"} 1
54# HELP elasticsearch_cluster_health_timed_out Number of cluster health checks timed out
55# TYPE elasticsearch_cluster_health_timed_out gauge
56elasticsearch_cluster_health_timed_out{cluster="labs-logstash-eqiad"} 0
57# HELP elasticsearch_cluster_health_total_scrapes Current total ElasticSearch cluster health scrapes.
58# TYPE elasticsearch_cluster_health_total_scrapes counter
59elasticsearch_cluster_health_total_scrapes 3
60# HELP elasticsearch_cluster_health_unassigned_shards The number of shards that exist in the cluster state, but cannot be found in the cluster itself.
61# TYPE elasticsearch_cluster_health_unassigned_shards gauge
62elasticsearch_cluster_health_unassigned_shards{cluster="labs-logstash-eqiad"} 72
63# HELP elasticsearch_cluster_health_up Was the last scrape of the ElasticSearch cluster health endpoint successful.
64# TYPE elasticsearch_cluster_health_up gauge
65elasticsearch_cluster_health_up 1
66# HELP elasticsearch_filesystem_data_available_bytes Available space on block device in bytes
67# TYPE elasticsearch_filesystem_data_available_bytes gauge
68elasticsearch_filesystem_data_available_bytes{cluster="labs-logstash-eqiad",host="10.68.16.147",mount="/srv (/dev/mapper/vd-second--local--disk)",name="deployment-logstash2",path="/srv/elasticsearch/labs-logstash-eqiad/nodes/0"} 6.7750191104e+10
69# HELP elasticsearch_filesystem_data_free_bytes Free space on block device in bytes
70# TYPE elasticsearch_filesystem_data_free_bytes gauge
71elasticsearch_filesystem_data_free_bytes{cluster="labs-logstash-eqiad",host="10.68.16.147",mount="/srv (/dev/mapper/vd-second--local--disk)",name="deployment-logstash2",path="/srv/elasticsearch/labs-logstash-eqiad/nodes/0"} 7.530979328e+10
72# HELP elasticsearch_filesystem_data_size_bytes Size of block device in bytes
73# TYPE elasticsearch_filesystem_data_size_bytes gauge
74elasticsearch_filesystem_data_size_bytes{cluster="labs-logstash-eqiad",host="10.68.16.147",mount="/srv (/dev/mapper/vd-second--local--disk)",name="deployment-logstash2",path="/srv/elasticsearch/labs-logstash-eqiad/nodes/0"} 1.48355293184e+11
75# HELP elasticsearch_indices_docs Count of documents on this node
76# TYPE elasticsearch_indices_docs gauge
77elasticsearch_indices_docs{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 1.51419768e+08
78# HELP elasticsearch_indices_docs_deleted Count of deleted documents on this node
79# TYPE elasticsearch_indices_docs_deleted gauge
80elasticsearch_indices_docs_deleted{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 5
81# HELP elasticsearch_indices_fielddata_evictions Evictions from field data
82# TYPE elasticsearch_indices_fielddata_evictions counter
83elasticsearch_indices_fielddata_evictions{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
84# HELP elasticsearch_indices_fielddata_memory_size_bytes Field data cache memory usage in bytes
85# TYPE elasticsearch_indices_fielddata_memory_size_bytes gauge
86elasticsearch_indices_fielddata_memory_size_bytes{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
87# HELP elasticsearch_indices_filter_cache_evictions Evictions from filter cache
88# TYPE elasticsearch_indices_filter_cache_evictions counter
89elasticsearch_indices_filter_cache_evictions{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
90# HELP elasticsearch_indices_filter_cache_memory_size_bytes Filter cache memory usage in bytes
91# TYPE elasticsearch_indices_filter_cache_memory_size_bytes gauge
92elasticsearch_indices_filter_cache_memory_size_bytes{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
93# HELP elasticsearch_indices_flush_time_seconds Cumulative flush time in seconds
94# TYPE elasticsearch_indices_flush_time_seconds counter
95elasticsearch_indices_flush_time_seconds{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 71
96# HELP elasticsearch_indices_flush_total Total flushes
97# TYPE elasticsearch_indices_flush_total counter
98elasticsearch_indices_flush_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 295
99# HELP elasticsearch_indices_get_exists_time_seconds Total time get exists in seconds
100# TYPE elasticsearch_indices_get_exists_time_seconds counter
101elasticsearch_indices_get_exists_time_seconds{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
102# HELP elasticsearch_indices_get_exists_total Total get exists operations
103# TYPE elasticsearch_indices_get_exists_total counter
104elasticsearch_indices_get_exists_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 2070
105# HELP elasticsearch_indices_get_missing_time_seconds Total time of get missing in seconds
106# TYPE elasticsearch_indices_get_missing_time_seconds counter
107elasticsearch_indices_get_missing_time_seconds{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
108# HELP elasticsearch_indices_get_missing_total Total get missing
109# TYPE elasticsearch_indices_get_missing_total counter
110elasticsearch_indices_get_missing_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 11
111# HELP elasticsearch_indices_get_time_seconds Total get time in seconds
112# TYPE elasticsearch_indices_get_time_seconds counter
113elasticsearch_indices_get_time_seconds{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
114# HELP elasticsearch_indices_get_total Total get
115# TYPE elasticsearch_indices_get_total counter
116elasticsearch_indices_get_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 2081
117# HELP elasticsearch_indices_indexing_delete_time_seconds_total Total time indexing delete in seconds
118# TYPE elasticsearch_indices_indexing_delete_time_seconds_total counter
119elasticsearch_indices_indexing_delete_time_seconds_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
120# HELP elasticsearch_indices_indexing_delete_total Total indexing deletes
121# TYPE elasticsearch_indices_indexing_delete_total counter
122elasticsearch_indices_indexing_delete_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 73
123# HELP elasticsearch_indices_indexing_index_time_seconds_total Total index calls
124# TYPE elasticsearch_indices_indexing_index_time_seconds_total counter
125elasticsearch_indices_indexing_index_time_seconds_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 61508
126# HELP elasticsearch_indices_indexing_index_total Cumulative index time in seconds
127# TYPE elasticsearch_indices_indexing_index_total counter
128elasticsearch_indices_indexing_index_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 1.49285547e+08
129# HELP elasticsearch_indices_merges_docs_total Cumulative docs merged
130# TYPE elasticsearch_indices_merges_docs_total counter
131elasticsearch_indices_merges_docs_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 7.82369423e+08
132# HELP elasticsearch_indices_merges_total Total merges
133# TYPE elasticsearch_indices_merges_total counter
134elasticsearch_indices_merges_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 57445
135# HELP elasticsearch_indices_merges_total_size_bytes_total Total merge size in bytes
136# TYPE elasticsearch_indices_merges_total_size_bytes_total counter
137elasticsearch_indices_merges_total_size_bytes_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 4.1708001883e+11
138# HELP elasticsearch_indices_merges_total_time_seconds_total Total time spent merging in seconds
139# TYPE elasticsearch_indices_merges_total_time_seconds_total counter
140elasticsearch_indices_merges_total_time_seconds_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 68130
141# HELP elasticsearch_indices_query_cache_evictions Evictions from query cache
142# TYPE elasticsearch_indices_query_cache_evictions counter
143elasticsearch_indices_query_cache_evictions{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 13858
144# HELP elasticsearch_indices_query_cache_memory_size_bytes Query cache memory usage in bytes
145# TYPE elasticsearch_indices_query_cache_memory_size_bytes gauge
146elasticsearch_indices_query_cache_memory_size_bytes{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 8.111796e+07
147# HELP elasticsearch_indices_refresh_time_seconds_total Total refreshes
148# TYPE elasticsearch_indices_refresh_time_seconds_total counter
149elasticsearch_indices_refresh_time_seconds_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 18392
150# HELP elasticsearch_indices_refresh_total Total time spent refreshing in seconds
151# TYPE elasticsearch_indices_refresh_total counter
152elasticsearch_indices_refresh_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 517508
153# HELP elasticsearch_indices_request_cache_evictions Evictions from request cache
154# TYPE elasticsearch_indices_request_cache_evictions counter
155elasticsearch_indices_request_cache_evictions{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
156# HELP elasticsearch_indices_request_cache_memory_size_bytes Request cache memory usage in bytes
157# TYPE elasticsearch_indices_request_cache_memory_size_bytes gauge
158elasticsearch_indices_request_cache_memory_size_bytes{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 1.024764e+07
159# HELP elasticsearch_indices_search_fetch_time_seconds Total search fetch time in seconds
160# TYPE elasticsearch_indices_search_fetch_time_seconds counter
161elasticsearch_indices_search_fetch_time_seconds{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 865
162# HELP elasticsearch_indices_search_fetch_total Total number of fetches
163# TYPE elasticsearch_indices_search_fetch_total counter
164elasticsearch_indices_search_fetch_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 1.038091e+06
165# HELP elasticsearch_indices_search_query_time_seconds Total search query time in seconds
166# TYPE elasticsearch_indices_search_query_time_seconds counter
167elasticsearch_indices_search_query_time_seconds{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 567
168# HELP elasticsearch_indices_search_query_total Total number of queries
169# TYPE elasticsearch_indices_search_query_total counter
170elasticsearch_indices_search_query_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 1.177098e+06
171# HELP elasticsearch_indices_segments_count Count of index segments on this node
172# TYPE elasticsearch_indices_segments_count gauge
173elasticsearch_indices_segments_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 85
174# HELP elasticsearch_indices_segments_memory_bytes Current memory size of segments in bytes
175# TYPE elasticsearch_indices_segments_memory_bytes gauge
176elasticsearch_indices_segments_memory_bytes{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 2.85187717e+08
177# HELP elasticsearch_indices_store_size_bytes Current size of stored index data in bytes
178# TYPE elasticsearch_indices_store_size_bytes gauge
179elasticsearch_indices_store_size_bytes{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 7.280255942e+10
180# HELP elasticsearch_indices_store_throttle_time_seconds_total Throttle time for index store in seconds
181# TYPE elasticsearch_indices_store_throttle_time_seconds_total counter
182elasticsearch_indices_store_throttle_time_seconds_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
183# HELP elasticsearch_indices_translog_operations Total translog operations
184# TYPE elasticsearch_indices_translog_operations counter
185elasticsearch_indices_translog_operations{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 65766
186# HELP elasticsearch_indices_translog_size_in_bytes Total translog size in bytes
187# TYPE elasticsearch_indices_translog_size_in_bytes counter
188elasticsearch_indices_translog_size_in_bytes{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 4.7832211e+07
189# HELP elasticsearch_jvm_gc_collection_seconds_count Count of JVM GC runs
190# TYPE elasticsearch_jvm_gc_collection_seconds_count counter
191elasticsearch_jvm_gc_collection_seconds_count{cluster="labs-logstash-eqiad",gc="old",host="10.68.16.147",name="deployment-logstash2"} 4
192elasticsearch_jvm_gc_collection_seconds_count{cluster="labs-logstash-eqiad",gc="young",host="10.68.16.147",name="deployment-logstash2"} 16388
193# HELP elasticsearch_jvm_gc_collection_seconds_sum GC run time in seconds
194# TYPE elasticsearch_jvm_gc_collection_seconds_sum counter
195elasticsearch_jvm_gc_collection_seconds_sum{cluster="labs-logstash-eqiad",gc="old",host="10.68.16.147",name="deployment-logstash2"} 1
196elasticsearch_jvm_gc_collection_seconds_sum{cluster="labs-logstash-eqiad",gc="young",host="10.68.16.147",name="deployment-logstash2"} 607
197# HELP elasticsearch_jvm_memory_committed_bytes JVM memory currently committed by area
198# TYPE elasticsearch_jvm_memory_committed_bytes gauge
199elasticsearch_jvm_memory_committed_bytes{area="heap",cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 5.351931904e+09
200elasticsearch_jvm_memory_committed_bytes{area="non-heap",cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 1.46210816e+08
201# HELP elasticsearch_jvm_memory_max_bytes JVM memory max
202# TYPE elasticsearch_jvm_memory_max_bytes gauge
203elasticsearch_jvm_memory_max_bytes{area="heap",cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 5.351931904e+09
204# HELP elasticsearch_jvm_memory_used_bytes JVM memory currently used by area
205# TYPE elasticsearch_jvm_memory_used_bytes gauge
206elasticsearch_jvm_memory_used_bytes{area="heap",cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 3.806239808e+09
207elasticsearch_jvm_memory_used_bytes{area="non-heap",cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 1.39041656e+08
208# HELP elasticsearch_node_stats_json_parse_failures Number of errors while parsing JSON.
209# TYPE elasticsearch_node_stats_json_parse_failures counter
210elasticsearch_node_stats_json_parse_failures 0
211# HELP elasticsearch_node_stats_total_scrapes Current total ElasticSearch node scrapes.
212# TYPE elasticsearch_node_stats_total_scrapes counter
213elasticsearch_node_stats_total_scrapes 3
214# HELP elasticsearch_node_stats_up Was the last scrape of the ElasticSearch nodes endpoint successful.
215# TYPE elasticsearch_node_stats_up gauge
216elasticsearch_node_stats_up 1
217# HELP elasticsearch_process_cpu_percent Percent CPU used by process
218# TYPE elasticsearch_process_cpu_percent gauge
219elasticsearch_process_cpu_percent{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
220# HELP elasticsearch_process_cpu_time_seconds_sum Process CPU time in seconds
221# TYPE elasticsearch_process_cpu_time_seconds_sum counter
222elasticsearch_process_cpu_time_seconds_sum{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="sys"} 0
223elasticsearch_process_cpu_time_seconds_sum{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="total"} 211083
224elasticsearch_process_cpu_time_seconds_sum{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="user"} 0
225# HELP elasticsearch_process_mem_resident_size_bytes Resident memory in use by process in bytes
226# TYPE elasticsearch_process_mem_resident_size_bytes gauge
227elasticsearch_process_mem_resident_size_bytes{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
228# HELP elasticsearch_process_mem_share_size_bytes Shared memory in use by process in bytes
229# TYPE elasticsearch_process_mem_share_size_bytes gauge
230elasticsearch_process_mem_share_size_bytes{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
231# HELP elasticsearch_process_mem_virtual_size_bytes Total virtual memory used in bytes
232# TYPE elasticsearch_process_mem_virtual_size_bytes gauge
233elasticsearch_process_mem_virtual_size_bytes{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 8.4163043328e+10
234# HELP elasticsearch_process_open_files_count Open file descriptors
235# TYPE elasticsearch_process_open_files_count gauge
236elasticsearch_process_open_files_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 352
237# HELP elasticsearch_thread_pool_active_count Thread Pool threads active
238# TYPE elasticsearch_thread_pool_active_count gauge
239elasticsearch_thread_pool_active_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="bulk"} 0
240elasticsearch_thread_pool_active_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="fetch_shard_started"} 0
241elasticsearch_thread_pool_active_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="fetch_shard_store"} 0
242elasticsearch_thread_pool_active_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="flush"} 0
243elasticsearch_thread_pool_active_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="force_merge"} 0
244elasticsearch_thread_pool_active_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="generic"} 0
245elasticsearch_thread_pool_active_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="get"} 0
246elasticsearch_thread_pool_active_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="index"} 0
247elasticsearch_thread_pool_active_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="listener"} 0
248elasticsearch_thread_pool_active_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="management"} 1
249elasticsearch_thread_pool_active_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="refresh"} 0
250elasticsearch_thread_pool_active_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="search"} 0
251elasticsearch_thread_pool_active_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="snapshot"} 0
252elasticsearch_thread_pool_active_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="warmer"} 0
253# HELP elasticsearch_thread_pool_completed_count Thread Pool operations completed
254# TYPE elasticsearch_thread_pool_completed_count counter
255elasticsearch_thread_pool_completed_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="bulk"} 2.1300098e+07
256elasticsearch_thread_pool_completed_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="fetch_shard_started"} 42
257elasticsearch_thread_pool_completed_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="fetch_shard_store"} 0
258elasticsearch_thread_pool_completed_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="flush"} 435
259elasticsearch_thread_pool_completed_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="force_merge"} 30
260elasticsearch_thread_pool_completed_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="generic"} 260537
261elasticsearch_thread_pool_completed_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="get"} 1018
262elasticsearch_thread_pool_completed_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="index"} 43
263elasticsearch_thread_pool_completed_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="listener"} 0
264elasticsearch_thread_pool_completed_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="management"} 2.350606e+06
265elasticsearch_thread_pool_completed_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="refresh"} 1.6937872e+07
266elasticsearch_thread_pool_completed_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="search"} 2.220301e+06
267elasticsearch_thread_pool_completed_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="snapshot"} 0
268elasticsearch_thread_pool_completed_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="warmer"} 517679
269# HELP elasticsearch_thread_pool_largest_count Thread Pool largest threads count
270# TYPE elasticsearch_thread_pool_largest_count gauge
271elasticsearch_thread_pool_largest_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="bulk"} 8
272elasticsearch_thread_pool_largest_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="fetch_shard_started"} 16
273elasticsearch_thread_pool_largest_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="fetch_shard_store"} 0
274elasticsearch_thread_pool_largest_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="flush"} 4
275elasticsearch_thread_pool_largest_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="force_merge"} 1
276elasticsearch_thread_pool_largest_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="generic"} 4
277elasticsearch_thread_pool_largest_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="get"} 8
278elasticsearch_thread_pool_largest_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="index"} 8
279elasticsearch_thread_pool_largest_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="listener"} 0
280elasticsearch_thread_pool_largest_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="management"} 5
281elasticsearch_thread_pool_largest_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="refresh"} 4
282elasticsearch_thread_pool_largest_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="search"} 13
283elasticsearch_thread_pool_largest_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="snapshot"} 0
284elasticsearch_thread_pool_largest_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="warmer"} 3
285# HELP elasticsearch_thread_pool_queue_count Thread Pool operations queued
286# TYPE elasticsearch_thread_pool_queue_count gauge
287elasticsearch_thread_pool_queue_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="bulk"} 0
288elasticsearch_thread_pool_queue_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="fetch_shard_started"} 0
289elasticsearch_thread_pool_queue_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="fetch_shard_store"} 0
290elasticsearch_thread_pool_queue_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="flush"} 0
291elasticsearch_thread_pool_queue_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="force_merge"} 0
292elasticsearch_thread_pool_queue_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="generic"} 0
293elasticsearch_thread_pool_queue_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="get"} 0
294elasticsearch_thread_pool_queue_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="index"} 0
295elasticsearch_thread_pool_queue_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="listener"} 0
296elasticsearch_thread_pool_queue_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="management"} 0
297elasticsearch_thread_pool_queue_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="refresh"} 0
298elasticsearch_thread_pool_queue_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="search"} 0
299elasticsearch_thread_pool_queue_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="snapshot"} 0
300elasticsearch_thread_pool_queue_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="warmer"} 0
301# HELP elasticsearch_thread_pool_rejected_count Thread Pool operations rejected
302# TYPE elasticsearch_thread_pool_rejected_count counter
303elasticsearch_thread_pool_rejected_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="bulk"} 0
304elasticsearch_thread_pool_rejected_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="fetch_shard_started"} 0
305elasticsearch_thread_pool_rejected_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="fetch_shard_store"} 0
306elasticsearch_thread_pool_rejected_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="flush"} 0
307elasticsearch_thread_pool_rejected_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="force_merge"} 0
308elasticsearch_thread_pool_rejected_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="generic"} 0
309elasticsearch_thread_pool_rejected_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="get"} 0
310elasticsearch_thread_pool_rejected_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="index"} 0
311elasticsearch_thread_pool_rejected_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="listener"} 0
312elasticsearch_thread_pool_rejected_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="management"} 0
313elasticsearch_thread_pool_rejected_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="refresh"} 0
314elasticsearch_thread_pool_rejected_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="search"} 0
315elasticsearch_thread_pool_rejected_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="snapshot"} 0
316elasticsearch_thread_pool_rejected_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="warmer"} 0
317# HELP elasticsearch_thread_pool_threads_count Thread Pool current threads count
318# TYPE elasticsearch_thread_pool_threads_count gauge
319elasticsearch_thread_pool_threads_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="bulk"} 8
320elasticsearch_thread_pool_threads_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="fetch_shard_started"} 1
321elasticsearch_thread_pool_threads_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="fetch_shard_store"} 0
322elasticsearch_thread_pool_threads_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="flush"} 1
323elasticsearch_thread_pool_threads_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="force_merge"} 1
324elasticsearch_thread_pool_threads_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="generic"} 4
325elasticsearch_thread_pool_threads_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="get"} 8
326elasticsearch_thread_pool_threads_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="index"} 8
327elasticsearch_thread_pool_threads_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="listener"} 0
328elasticsearch_thread_pool_threads_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="management"} 5
329elasticsearch_thread_pool_threads_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="refresh"} 4
330elasticsearch_thread_pool_threads_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="search"} 13
331elasticsearch_thread_pool_threads_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="snapshot"} 0
332elasticsearch_thread_pool_threads_count{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2",type="warmer"} 3
333# HELP elasticsearch_transport_rx_packets_total Count of packets received
334# TYPE elasticsearch_transport_rx_packets_total counter
335elasticsearch_transport_rx_packets_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
336# HELP elasticsearch_transport_rx_size_bytes_total Total number of bytes received
337# TYPE elasticsearch_transport_rx_size_bytes_total counter
338elasticsearch_transport_rx_size_bytes_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
339# HELP elasticsearch_transport_tx_packets_total Count of packets sent
340# TYPE elasticsearch_transport_tx_packets_total counter
341elasticsearch_transport_tx_packets_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
342# HELP elasticsearch_transport_tx_size_bytes_total Total number of bytes sent
343# TYPE elasticsearch_transport_tx_size_bytes_total counter
344elasticsearch_transport_tx_size_bytes_total{cluster="labs-logstash-eqiad",host="10.68.16.147",name="deployment-logstash2"} 0
345# HELP go_gc_duration_seconds A summary of the GC invocation durations.
346# TYPE go_gc_duration_seconds summary
347go_gc_duration_seconds{quantile="0"} 0
348go_gc_duration_seconds{quantile="0.25"} 0
349go_gc_duration_seconds{quantile="0.5"} 0
350go_gc_duration_seconds{quantile="0.75"} 0
351go_gc_duration_seconds{quantile="1"} 0
352go_gc_duration_seconds_sum 0
353go_gc_duration_seconds_count 0
354# HELP go_goroutines Number of goroutines that currently exist.
355# TYPE go_goroutines gauge
356go_goroutines 16
357# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
358# TYPE go_memstats_alloc_bytes gauge
359go_memstats_alloc_bytes 2.225704e+06
360# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
361# TYPE go_memstats_alloc_bytes_total counter
362go_memstats_alloc_bytes_total 2.225704e+06
363# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
364# TYPE go_memstats_buck_hash_sys_bytes gauge
365go_memstats_buck_hash_sys_bytes 1.444605e+06
366# HELP go_memstats_frees_total Total number of frees.
367# TYPE go_memstats_frees_total counter
368go_memstats_frees_total 3614
369# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
370# TYPE go_memstats_gc_sys_bytes gauge
371go_memstats_gc_sys_bytes 202752
372# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
373# TYPE go_memstats_heap_alloc_bytes gauge
374go_memstats_heap_alloc_bytes 2.225704e+06
375# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
376# TYPE go_memstats_heap_idle_bytes gauge
377go_memstats_heap_idle_bytes 245760
378# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
379# TYPE go_memstats_heap_inuse_bytes gauge
380go_memstats_heap_inuse_bytes 3.391488e+06
381# HELP go_memstats_heap_objects Number of allocated objects.
382# TYPE go_memstats_heap_objects gauge
383go_memstats_heap_objects 22705
384# HELP go_memstats_heap_released_bytes_total Total number of heap bytes released to OS.
385# TYPE go_memstats_heap_released_bytes_total counter
386go_memstats_heap_released_bytes_total 0
387# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
388# TYPE go_memstats_heap_sys_bytes gauge
389go_memstats_heap_sys_bytes 3.637248e+06
390# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
391# TYPE go_memstats_last_gc_time_seconds gauge
392go_memstats_last_gc_time_seconds 0
393# HELP go_memstats_lookups_total Total number of pointer lookups.
394# TYPE go_memstats_lookups_total counter
395go_memstats_lookups_total 45
396# HELP go_memstats_mallocs_total Total number of mallocs.
397# TYPE go_memstats_mallocs_total counter
398go_memstats_mallocs_total 26319
399# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
400# TYPE go_memstats_mcache_inuse_bytes gauge
401go_memstats_mcache_inuse_bytes 13888
402# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
403# TYPE go_memstats_mcache_sys_bytes gauge
404go_memstats_mcache_sys_bytes 16384
405# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
406# TYPE go_memstats_mspan_inuse_bytes gauge
407go_memstats_mspan_inuse_bytes 49704
408# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
409# TYPE go_memstats_mspan_sys_bytes gauge
410go_memstats_mspan_sys_bytes 65536
411# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
412# TYPE go_memstats_next_gc_bytes gauge
413go_memstats_next_gc_bytes 4.473924e+06
414# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
415# TYPE go_memstats_other_sys_bytes gauge
416go_memstats_other_sys_bytes 1.549819e+06
417# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
418# TYPE go_memstats_stack_inuse_bytes gauge
419go_memstats_stack_inuse_bytes 557056
420# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
421# TYPE go_memstats_stack_sys_bytes gauge
422go_memstats_stack_sys_bytes 557056
423# HELP go_memstats_sys_bytes Number of bytes obtained by system. Sum of all system allocations.
424# TYPE go_memstats_sys_bytes gauge
425go_memstats_sys_bytes 7.4734e+06
426# HELP http_request_duration_microseconds The HTTP request latencies in microseconds.
427# TYPE http_request_duration_microseconds summary
428http_request_duration_microseconds{handler="prometheus",quantile="0.5"} 43769.45
429http_request_duration_microseconds{handler="prometheus",quantile="0.9"} 86554.334
430http_request_duration_microseconds{handler="prometheus",quantile="0.99"} 86554.334
431http_request_duration_microseconds_sum{handler="prometheus"} 130323.784
432http_request_duration_microseconds_count{handler="prometheus"} 2
433# HELP http_request_size_bytes The HTTP request sizes in bytes.
434# TYPE http_request_size_bytes summary
435http_request_size_bytes{handler="prometheus",quantile="0.5"} 63
436http_request_size_bytes{handler="prometheus",quantile="0.9"} 63
437http_request_size_bytes{handler="prometheus",quantile="0.99"} 63
438http_request_size_bytes_sum{handler="prometheus"} 126
439http_request_size_bytes_count{handler="prometheus"} 2
440# HELP http_requests_total Total number of HTTP requests made.
441# TYPE http_requests_total counter
442http_requests_total{code="200",handler="prometheus",method="get"} 2
443# HELP http_response_size_bytes The HTTP response sizes in bytes.
444# TYPE http_response_size_bytes summary
445http_response_size_bytes{handler="prometheus",quantile="0.5"} 41416
446http_response_size_bytes{handler="prometheus",quantile="0.9"} 41630
447http_response_size_bytes{handler="prometheus",quantile="0.99"} 41630
448http_response_size_bytes_sum{handler="prometheus"} 83046
449http_response_size_bytes_count{handler="prometheus"} 2
450# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
451# TYPE process_cpu_seconds_total counter
452process_cpu_seconds_total 0.03
453# HELP process_max_fds Maximum number of open file descriptors.
454# TYPE process_max_fds gauge
455process_max_fds 1024
456# HELP process_open_fds Number of open file descriptors.
457# TYPE process_open_fds gauge
458process_open_fds 9
459# HELP process_resident_memory_bytes Resident memory size in bytes.
460# TYPE process_resident_memory_bytes gauge
461process_resident_memory_bytes 8.482816e+06
462# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
463# TYPE process_start_time_seconds gauge
464process_start_time_seconds 1.51196497511e+09
465# HELP process_virtual_memory_bytes Virtual memory size in bytes.
466# TYPE process_virtual_memory_bytes gauge
467process_virtual_memory_bytes 2.9872128e+08

Event Timeline

I tried jmx_exporter on deployment-logstash2 with the results below. A few notes: the exporter config needs to be somewhere accessible by elasticsearch (e.g. /srv/elasticsearch) or asking for metrics fails with something like

<title>Error 500 access denied ("java.io.FilePermission"  "/etc/prometheus/jmx_exporter.yaml" "read")</title>

In later versions of jmx_exporter the permission denied error isn't even reported and jetty just closes the connection without a status code.
Also the scrape is marked as failed by jmx_exporter though some metrics _have_ been scraped.

$ curl localhost:9499/metrics  -s
# HELP jmx_scrape_duration_seconds Time this JMX scrape took, in seconds.
# TYPE jmx_scrape_duration_seconds gauge
jmx_scrape_duration_seconds 0.001576552
# HELP jmx_scrape_error Non-zero if this scrape failed.
# TYPE jmx_scrape_error gauge
jmx_scrape_error 1.0
# HELP jvm_gc_collection_seconds Time spent in a given JVM garbage collector in seconds.
# TYPE jvm_gc_collection_seconds summary
jvm_gc_collection_seconds_count{gc="PS Scavenge",} 12.0
jvm_gc_collection_seconds_sum{gc="PS Scavenge",} 0.945
jvm_gc_collection_seconds_count{gc="PS MarkSweep",} 3.0
jvm_gc_collection_seconds_sum{gc="PS MarkSweep",} 0.414
# HELP jmx_config_reload_failure_total Number of times configuration have failed to be reloaded.
# TYPE jmx_config_reload_failure_total counter
jmx_config_reload_failure_total 0.0
# HELP jvm_threads_current Current thread count of a JVM
# TYPE jvm_threads_current gauge
jvm_threads_current 77.0
# HELP jvm_threads_daemon Daemon thread count of a JVM
# TYPE jvm_threads_daemon gauge
jvm_threads_daemon 69.0
# HELP jvm_threads_peak Peak thread count of a JVM
# TYPE jvm_threads_peak gauge
jvm_threads_peak 89.0
# HELP jvm_threads_started_total Started thread count of a JVM
# TYPE jvm_threads_started_total counter
jvm_threads_started_total 122.0
# HELP jvm_threads_deadlocked Cycles of JVM-threads that are in deadlock waiting to acquire object monitors or ownable synchronizers
# TYPE jvm_threads_deadlocked gauge
jvm_threads_deadlocked 0.0
# HELP jvm_threads_deadlocked_monitor Cycles of JVM-threads that are in deadlock waiting to acquire object monitors
# TYPE jvm_threads_deadlocked_monitor gauge
jvm_threads_deadlocked_monitor 0.0
# HELP jvm_info JVM version info
# TYPE jvm_info gauge
jvm_info{version="1.8.0_151-8u151-b12-1~bpo8+1-b12",vendor="Oracle Corporation",} 1.0
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.512145079031E9
# HELP jvm_classes_loaded The number of classes that are currently loaded in the JVM
# TYPE jvm_classes_loaded gauge
jvm_classes_loaded 11670.0
# HELP jvm_classes_loaded_total The total number of classes that have been loaded since the JVM has started execution
# TYPE jvm_classes_loaded_total counter
jvm_classes_loaded_total 11672.0
# HELP jvm_classes_unloaded_total The total number of classes that have been unloaded since the JVM has started execution
# TYPE jvm_classes_unloaded_total counter
jvm_classes_unloaded_total 2.0
# HELP jmx_config_reload_success_total Number of times configuration have successfully been reloaded.
# TYPE jmx_config_reload_success_total counter
jmx_config_reload_success_total 0.0
# HELP jvm_memory_bytes_used Used bytes of a given JVM memory area.
# TYPE jvm_memory_bytes_used gauge
jvm_memory_bytes_used{area="heap",} 1.51883844E9
jvm_memory_bytes_used{area="nonheap",} 1.06190416E8
# HELP jvm_memory_bytes_committed Committed (bytes) of a given JVM memory area.
# TYPE jvm_memory_bytes_committed gauge
jvm_memory_bytes_committed{area="heap",} 5.00170752E9
jvm_memory_bytes_committed{area="nonheap",} 1.11788032E8
# HELP jvm_memory_bytes_max Max (bytes) of a given JVM memory area.
# TYPE jvm_memory_bytes_max gauge
jvm_memory_bytes_max{area="heap",} 5.00170752E9
jvm_memory_bytes_max{area="nonheap",} -1.0
# HELP jvm_memory_pool_bytes_used Used bytes of a given JVM memory pool.
# TYPE jvm_memory_pool_bytes_used gauge
jvm_memory_pool_bytes_used{pool="Code Cache",} 3.7606336E7
jvm_memory_pool_bytes_used{pool="Metaspace",} 6.10214E7
jvm_memory_pool_bytes_used{pool="Compressed Class Space",} 7562680.0
jvm_memory_pool_bytes_used{pool="PS Eden Space",} 8.73788184E8
jvm_memory_pool_bytes_used{pool="PS Survivor Space",} 2.30681872E8
jvm_memory_pool_bytes_used{pool="PS Old Gen",} 4.14368384E8
# HELP jvm_memory_pool_bytes_committed Committed bytes of a given JVM memory pool.
# TYPE jvm_memory_pool_bytes_committed gauge
jvm_memory_pool_bytes_committed{pool="Code Cache",} 3.7945344E7
jvm_memory_pool_bytes_committed{pool="Metaspace",} 6.4970752E7
jvm_memory_pool_bytes_committed{pool="Compressed Class Space",} 8871936.0
jvm_memory_pool_bytes_committed{pool="PS Eden Space",} 1.052246016E9
jvm_memory_pool_bytes_committed{pool="PS Survivor Space",} 3.70147328E8
jvm_memory_pool_bytes_committed{pool="PS Old Gen",} 3.579314176E9
# HELP jvm_memory_pool_bytes_max Max bytes of a given JVM memory pool.
# TYPE jvm_memory_pool_bytes_max gauge
jvm_memory_pool_bytes_max{pool="Code Cache",} 2.5165824E8
jvm_memory_pool_bytes_max{pool="Metaspace",} -1.0
jvm_memory_pool_bytes_max{pool="Compressed Class Space",} 1.073741824E9
jvm_memory_pool_bytes_max{pool="PS Eden Space",} 1.052246016E9
jvm_memory_pool_bytes_max{pool="PS Survivor Space",} 3.70147328E8
jvm_memory_pool_bytes_max{pool="PS Old Gen",} 3.579314176E9

@Gehel I still have to figure out what's wrong with jmx_exporter above, but what do you think re: metrics in P6392 ?

I tried jmx_exporter on deployment-logstash2 with the results below. A few notes: the exporter config needs to be somewhere accessible by elasticsearch (e.g. /srv/elasticsearch) or asking for metrics fails with something like [...]

/srv/elasticsearch/jmx_exporter.yaml looks more like a config file, I would expect to find it in /etc/elasticsearch.

@Gehel I still have to figure out what's wrong with jmx_exporter above, but what do you think re: metrics in P6392 ?

I did not do a full comparision with what we have in Graphite atm, but it looks like the most important metrics are there. We should probably keep both while we migrate the dashboards and see if we are missing something important.

A few notes about P6392:

  • we seem to have some duplication with system metrics (like the free space on /srv/)
  • we have some duplication with jmx_exporter (as in T181627#3803312)
    • GC is reported both in elasticsearch_exporter and in jmx_exporter. I much prefer to have GC metrics standardized across application (so exported through JMX, not through the elasticsearch specific exporter) (side note, the naming seems weird: elasticsearch_jvm_gc_collection_seconds_count is it a count of the number of GC, or a duration in seconds?)
  • there seems to be some metrics about go garbage collection. As far as I know, we don't have anything running on go for the elasticsearch statck (or is it the elasticsaerch_exporter itself? it that's the case, do we really need those metrics?)
  • we have both node level metrics (example: all the thread pools) and cluster level metrics (example: all the *cluster_health*). We probably don't want to collect cluster level metrics from all nodes, but to collect them from the LVS endpoint.

Mentioned in SAL (#wikimedia-operations) [2017-12-13T09:41:57Z] <godog> upload prometheus-elasticsearch-exporter to jessie-wikimedia - T181627

I tried jmx_exporter on deployment-logstash2 with the results below. A few notes: the exporter config needs to be somewhere accessible by elasticsearch (e.g. /srv/elasticsearch) or asking for metrics fails with something like [...]

/srv/elasticsearch/jmx_exporter.yaml looks more like a config file, I would expect to find it in /etc/elasticsearch.

Indeed, that's what I tried first but didn't work. I'm not 100% sure why but I bet it has to do with jvm restrictions set up to avoid navigating the filesystem at will.

@Gehel I still have to figure out what's wrong with jmx_exporter above, but what do you think re: metrics in P6392 ?

I did not do a full comparision with what we have in Graphite atm, but it looks like the most important metrics are there. We should probably keep both while we migrate the dashboards and see if we are missing something important.

+1

A few notes about P6392:

  • we seem to have some duplication with system metrics (like the free space on /srv/)
  • we have some duplication with jmx_exporter (as in T181627#3803312)
    • GC is reported both in elasticsearch_exporter and in jmx_exporter. I much prefer to have GC metrics standardized across application (so exported through JMX, not through the elasticsearch specific exporter) (side note, the naming seems weird: elasticsearch_jvm_gc_collection_seconds_count is it a count of the number of GC, or a duration in seconds?)
  • there seems to be some metrics about go garbage collection. As far as I know, we don't have anything running on go for the elasticsearch statck (or is it the elasticsaerch_exporter itself? it that's the case, do we really need those metrics?)
  • we have both node level metrics (example: all the thread pools) and cluster level metrics (example: all the *cluster_health*). We probably don't want to collect cluster level metrics from all nodes, but to collect them from the LVS endpoint.

After a chat with @Gehel to shed some light we'll proceed with the following:

  1. Start with deploying elasticsearch-exporter in each node (logstash first), collecting cluster-level metrics as well from all nodes. Get feedback for alerting and dashboards. Alerting will be adjusted/tested accordingly (e.g. for unallocated shards take the maximum, to avoid a shower of alerts from all hosts.
  2. Deploy elasticsearch-exporter to cirrus es cluster as well.
  3. [Likely next year] Figure out how to deploy jmx_exporter as well (more complicated, requires cluster restart)
    • Investigate the jmx_exporter config path issue (can't be under /etc/elasticsearch/ ?)
    • Investigate the fact that metrics are being reported by jmx_exporter but an error during scrape is reported jmx_scrape_error 1.0

Change 398025 had a related patch set uploaded (by Gehel; owner: Guillaume Lederrey):
[operations/puppet@production] elasticsearch: deploy prometheus-elasticsearch-exporter

https://gerrit.wikimedia.org/r/398025

Change 398026 had a related patch set uploaded (by Gehel; owner: Guillaume Lederrey):
[operations/puppet@production] logstash: activate prometheus elasticsearch exporter

https://gerrit.wikimedia.org/r/398026

Change 398027 had a related patch set uploaded (by Gehel; owner: Guillaume Lederrey):
[operations/puppet@production] elasticsearch: activate prometheus elasticsearch exporter

https://gerrit.wikimedia.org/r/398027

Change 398051 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: configure prometheus to collect metrics from elasticsearch

https://gerrit.wikimedia.org/r/398051

Change 398025 merged by Gehel:
[operations/puppet@production] elasticsearch: deploy prometheus-elasticsearch-exporter

https://gerrit.wikimedia.org/r/398025

Change 398026 merged by Gehel:
[operations/puppet@production] logstash: activate prometheus elasticsearch exporter

https://gerrit.wikimedia.org/r/398026

Change 398027 merged by Gehel:
[operations/puppet@production] elasticsearch: activate prometheus elasticsearch exporter

https://gerrit.wikimedia.org/r/398027

Change 398059 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: configure prometheus to collect metrics from both logstash and elasticsearch

https://gerrit.wikimedia.org/r/398059

Change 398051 merged by Gehel:
[operations/puppet@production] elasticsearch: configure prometheus to collect metrics from logstash

https://gerrit.wikimedia.org/r/398051

Change 398059 merged by Gehel:
[operations/puppet@production] elasticsearch: configure prometheus to collect metrics from elasticsearch

https://gerrit.wikimedia.org/r/398059

elasticsearch_exporter is deployed on all elasticsearch nodes. Still to do:

  • same work on jmx_exporter
  • update grafana dashboards

Now that I understand a bit better how prometheus works, the jmx_exporter starts to be scary. If I understand correctly, it exposes all MBeans and filtering is done after the fact. We should really never use the jmx_exporter without a whitelist.

Querying all MBeans can be dangerous. MBean can expose anything, including stuff that is expensive to compute, blocking, or even potentially crash the application. JMX is (too) powerful...

While updating the Shards graph, I found that the way shards are exposed through prometheus is not optimal. We have different metric names for different shard states:

  • elasticsearch_cluster_health_active_shards
  • elasticsearch_cluster_health_delayed_unassigned_shards
  • elasticsearch_cluster_health_initializing_shards
  • ...

It seems that it would be simpler to query them if we had a metric named elasticsearch_cluster_health_shards with labels for the different states (active, delayed_unassigned, initializing, ...).

The way the metrics are exposed seems to be linked to way the elasticsearch_exporter exposes them, but there might be an easy way to rename those. @fgiunchedi your input would be welcomed.

While updating the Shards graph, I found that the way shards are exposed through prometheus is not optimal. We have different metric names for different shard states:

  • elasticsearch_cluster_health_active_shards
  • elasticsearch_cluster_health_delayed_unassigned_shards
  • elasticsearch_cluster_health_initializing_shards
  • ...

It seems that it would be simpler to query them if we had a metric named elasticsearch_cluster_health_shards with labels for the different states (active, delayed_unassigned, initializing, ...).

The way the metrics are exposed seems to be linked to way the elasticsearch_exporter exposes them, but there might be an easy way to rename those. @fgiunchedi your input would be welcomed.

I'm not sure it applies here but the general recommendation for having dimensions in labels as opposed to metric names is whether sum() on the metric would yield a meaningful number (e.g. no double counting). Does the split make more sense with that in mind?
WRT dashboarding and working with metrics like the above, you can also use __name__ as a label to match the metric name, for example {__name__=~'elasticsearch_cluster_health_(active|initializing)_shards'}

The sum of all shard states does make some kind of sense: the some of all states should give the total number of shards, but the total is also exposed... so not sure.

The __name__ trick should work in my case. So let's leave it at that atm. Thanks!

Looking at the prometheus-jmx-exporter .deb, it seems to depend on default-jre which is openjdk-7-jre on Jessie. We use OpenJDK 8 for elasticsearch (and wdqs, and hopefully for most of our Java based stack). Having 2 different JRE installed is a source of unnecessary pain. My understanding of Debian packaging is low enough that I don't really know how this should be fixed. Maybe depend on java7-runtime instead of default-jre since both openjdk-7 and openjdk-8 provide it? Or remove the JRE dependency completely, since the application being monitored is responsible to depend on the appropriate JRE (this is true for the agent part of the exporter, not for the HTTP server part).

I noticed the same re: jre dependencies and fixed it in https://gerrit.wikimedia.org/r/#/c/394322/ though that version isn't built/uploaded yet. @Ottomata what's the procedure you are using for those? I see versions with ~jessie1 and ~stretch1 uploaded ATM

While migrating existing grafana dashboards, it looks like some dashboards are broken and most probably unused. We should delete them instead of taking time to migrate:

@EBernhardson do you want to keep one of those?

load testing and percentiles could certainly go away. cluster recovery might be useful at some point in the future but hard to say. All of that data is found in other dashboards though, just not broken out by server and in a single board. It certainly wont be immediately useful, and if the data is there it can be recreated as necessary.

Unrelated to dashboards, but for prometheus. We will likely need a fork (or an additional custom collector) to collect extra metrics that are only reported by our cluster. Specifically we collect per-node latency percentiles into a custom api endpoint on the elasticsearch servers. This isn't even in diamond yet as we only recently upgraded the plugin version on the cluster to expose these metrics.

I'm probably doing somethign wrong with the ~jessie1 and ~stretch1 versions. It's JVM right? So the same build should be good for both distros? Not sure. I've just always done that and didn't bother to check if I shouldn't.

Hm, looks like the last line of README.debian got cut off. Should be:

USENETWORK=yes GIT_PBUILDER_AUTOCONF=no DIST=stretch WIKIMEDIA=yes gbp buildpackage -sa -us -uc --git-builder=git-pbuilder

I ported elasticsearch-memory and elasticsearch-indexing.

Some metrics where unavailable (not speaking about mediawiki counters) with prometheus so I flagged thes graph as (prometheus todo).

available in graphite but missing in prometheus (elastic specific):

  • elasticsearch.indices.segments.terms_memory_in_bytes
  • elasticsearch.indices.segments.index_writer_memory_in_bytes
  • elasticsearch.indices.completion.size_in_bytes
  • elasticsearch.indices.segments.norms_memory_in_bytes
  • elasticsearch.indices.segments.stored_fields_memory_in_bytes
  • elasticsearch.indices.segments.doc_values_memory_in_bytes
  • elasticsearch.indices.segments.fixed_bit_set_memory_in_bytes

iostats:

  • iostat.{device}.iops, used node_disk_reads_completed+node_disk_writes_completed instead
  • iostat.average_queue_length, used node_disk_io_now but this one is not very precise and the curve is always flat

Additional missing metrics:

  • elasticsearch.indices.search.groups.prefix.query_total
  • elasticsearch.indices.search.groups.prefix.query_time_in_millis
  • elasticsearch.indices.suggest.total
  • elasticsearch.indices.suggest.time_in_millis
  • elasticsearch.indices.search.groups.full_text.query_total
  • elasticsearch.indices.search.groups.full_text.query_time_in_millis
  • elasticsearch.indices.search.groups.more_like.query_total
  • elasticsearch.indices.search.groups.more_like.query_time_in_millis

I'm probably doing somethign wrong with the ~jessie1 and ~stretch1 versions. It's JVM right? So the same build should be good for both distros? Not sure. I've just always done that and didn't bother to check if I shouldn't.

While elasticsearch is JVM, the exporter is Go, runs outside the JVM and uses the elasticsearch HTTP API to collect metrics. As I understand it, Go statically links everything, so the same build should still be good for both distro (but please check me on that).

As I understand it, Go statically links everything, so the same build should still be good for both distro (but please check me on that).

That's correct. For jessie we couldn't even build these since the toolchain is fairly immature there.

While elasticsearch is JVM, the exporter is Go ?

Huh, I thought we were talking about prometheus-jmx-exporter, which is a java project: https://github.com/wikimedia/operations-debs-prometheus-jmx-exporter/tree/master/jmx_prometheus_httpserver/src/main/java/io/prometheus/jmx

Unfortunately all of the elasticsearch-specific metrics are no exposed over jmx. We can get generic JVM info that way, but for the specialized stats we have to query the elasticsearch APIs.

Ah ok, my response was to Filippo asking how to build prometheus-jmx-exporter.

Short summary of the status of this task (since quite a lot of discussion happened):

  • the prometheus elasticsearch_exporter is deployed everywhere
  • some metrics are missing, issue is opened upstream (https://github.com/justwatchcom/elasticsearch_exporter/issues/115)
  • some metrics are specific to us (per node latency percentiles), we need to implement a specific collector for those (this should be done in another task: T183451)
  • copies of all existing dashboards have been created, using the Prometheus metrics
  • the old dashboards should be deleted once we have enough historical data (1 month?)
  • jmx_exporter has not yet been deployed or used in any way

A .deb of prometheus jxm_exporter is now available. I started to experiment on deployment-elastic06. Elasticsearch installs a fairly strict security manager that prevents the jmx_exporter agent from working properly. Adding a /home/elasticsearch/.java.policy file with the content below solves the issue:

grant codeBase "file:/usr/share/java/prometheus/-" {
  permission javax.management.MBeanServerPermission "createMBeanServer";
  permission javax.management.MBeanPermission "*", "*";
  permission java.lang.RuntimePermission "accessClassInPackage.sun.management";
  permission java.io.FilePermission "/etc/elasticsearch/prometheus_jmx_exporter.yaml", "read";
  permission java.io.FilePermission "/proc/self/status", "read";
};

It isn't very clear from the jmx_exporter documentation, but in addition to the MBeans that you configure, the exporter always exposes some standard JVM metrics. Those seem sufficient as JVM level metrics, and provide a good naming and help messages.

Change 402095 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch / prometheus: enable prometheus jmx_exporter

https://gerrit.wikimedia.org/r/402095

Playing with jmx_exporter and elasticsearch, it looks like the metrics exposed through the elasticsearch API are already sufficient, and the naming is consistent between jmx_exporter and elasticsearch_exporter. So we can probably drop the jmx_exporter for elasticsearch.

Upstream has released version 1.0.2 with the additional elasticsearch metrics that we need: https://github.com/justwatchcom/elasticsearch_exporter/tree/v1.0.2

elasticsearch exporter has been upgraded across all elasticsearch nodes. We should now have all the metrics we need. We can close this task and reopen if we find something we miss.

Change 402095 abandoned by Gehel:
elasticsearch / prometheus: enable prometheus jmx_exporter

Reason:
Correct, we have all the metrics we need through the elasticsearch_exporter. Dropping this change.

https://gerrit.wikimedia.org/r/402095