Page MenuHomePhabricator

Prometheus etherpad scrape failure
Closed, ResolvedPublic

Description

This is a followup from JobDown alert for etherpad1003, the /metrics endpoint can't be scraped anymore:

prometheus1005:~$ curl etherpad1003:9198/metrics -v
* Uses proxy env variable no_proxy == '.wmnet'
*   Trying 2620:0:861:103:10:64:32:181:9198...
* connect to 2620:0:861:103:10:64:32:181 port 9198 failed: Connection refused
*   Trying 10.64.32.181:9198...
* Connected to etherpad1003 (10.64.32.181) port 9198 (#0)
> GET /metrics HTTP/1.1
> Host: etherpad1003:9198
> User-Agent: curl/7.74.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
* HTTP 1.0, assume close after body
< HTTP/1.0 500 Internal Server Error
< Date: Wed, 16 Feb 2022 13:07:17 GMT
< Server: WSGIServer/0.2 CPython/3.9.2
< Content-Type: text/plain
< Content-Length: 59
< 
* Closing connection 0
A server error occurred.  Please contact the administrator.prometheus1005:~$ 
prometheus1005:~$

Event Timeline

Hm, this shows up in the logs

Feb 16 13:40:29 etherpad1003 prometheus-etherpad-exporter[805482]: UnboundLocalError: local variable 'metric_name' referenced before assignment
Feb 16 13:40:29 etherpad1003 prometheus-etherpad-exporter[805482]:     if metric_name is None:
Feb 16 13:40:29 etherpad1003 prometheus-etherpad-exporter[805482]:   File "/usr/bin/prometheus-etherpad-exporter", line 98, in collect

It's from https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/prometheus-etherpad-exporter/+/refs/heads/master/prometheus-etherpad-exporter#98

Looks like some new metric was added, none of the previous if clauses match and metric_name ends up being undefined. Easy to fix, I 'll upload a new version shortly

Change 763231 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/debs/prometheus-etherpad-exporter@master] Make sure that metric_name is defined in all cases

https://gerrit.wikimedia.org/r/763231

There seem to be quite a few new metrics around.

{
  "httpStartTime": 1644710413335,
  "memoryUsage": 328769536,
  "memoryUsageHeap": 170679824,
  "ueberdb_lockAwaits": 7301490,
  "ueberdb_lockAcquires": 10998304,
  "ueberdb_lockReleases": 10998304,
  "ueberdb_reads": 9734693,
  "ueberdb_readsFailed": 0,
  "ueberdb_readsFinished": 9734693,
  "ueberdb_readsFromCache": 8657615,
  "ueberdb_readsFromDb": 1077078,
  "ueberdb_readsFromDbFailed": 0,
  "ueberdb_readsFromDbFinished": 1077078,
  "ueberdb_writes": 1766897,
  "ueberdb_writesFailed": 0,
  "ueberdb_writesFinished": 1766897,
  "ueberdb_writesObsoleted": 2260,
  "ueberdb_writesToDb": 1764637,
  "ueberdb_writesToDbFailed": 0,
  "ueberdb_writesToDbFinished": 1764637,
  "ueberdb_writesToDbRetried": 0,
  "totalUsers": 252,
  "activePads": 178,
  "connects": {
    "mean": 0.28224445351117344,
    "count": 87207,
    "currentRate": 0.7787125514684425,
    "1MinuteRate": 1.0583791312777027,
    "5MinuteRate": 1.0369410626406488,
    "15MinuteRate": 0.8715532876866717
  },
  "httpRequests": {
    "meter": {
      "mean": 0.08954479707058248,
      "count": 27667,
      "currentRate": 0.38936207733384914,
      "1MinuteRate": 0.18853907914441984,
      "5MinuteRate": 0.17310127132335698,
      "15MinuteRate": 0.16664917651700692
    },
    "histogram": {
      "min": 0.4181770086288452,
      "max": 47645.5064420104,
      "sum": 303403.0886002779,
      "variance": 98463.8855161045,
      "mean": 10.966244572966996,
      "stddev": 313.7895560978799,
      "count": 27667,
      "median": 4.337095528841019,
      "p75": 6.655615255236626,
      "p95": 12.588126784563048,
      "p99": 34.504288427829785,
      "p999": 178.42954433608062
    }
  },
  "pendingEdits": 0,
  "edits": {
    "meter": {
      "mean": 1.3253110001104356,
      "count": 409465,
      "currentRate": 1.1675256595203307,
      "1MinuteRate": 2.8448791339993447,
      "5MinuteRate": 2.3583360562928872,
      "15MinuteRate": 2.2805906636154085
    },
    "histogram": {
      "min": 0.7772610187530518,
      "max": 5701.008677005768,
      "sum": 42942831.51661429,
      "variance": 2079.1729035240323,
      "mean": 104.87546314487022,
      "stddev": 45.59794845740357,
      "count": 409465,
      "median": 103.72353249788284,
      "p75": 108.33816875517368,
      "p95": 166.6283551812172,
      "p99": 180.4083765709402,
      "p999": 530.7297073833348
    }
  },
  "lastDisconnect": 1645019390274,
  "disconnects": {
    "mean": 0.2814571619698628,
    "count": 86955,
    "currentRate": 0.7770007401700038,
    "1MinuteRate": 1.0230088887220086,
    "5MinuteRate": 1.0306174352352657,
    "15MinuteRate": 0.8690210290804122
  },
  "failedChangesets": {
    "mean": 0.00023967715862466122,
    "count": 70,
    "currentRate": 0,
    "1MinuteRate": 1.6366468134818745e-69,
    "5MinuteRate": 1.3221474277977508e-16,
    "15MinuteRate": 3.790005778299575e-08
  }
}

Change 763238 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/debs/prometheus-etherpad-exporter@master] Add new metrics present in 1.8.16

https://gerrit.wikimedia.org/r/763238

Change 763239 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/debs/prometheus-etherpad-exporter@master] Release 0.6 prometheus-etherpad-exporter

https://gerrit.wikimedia.org/r/763239

I 've hotpatched this in production to stop the bleeding for now but the proper way to solve this is of course to add support for the new metrics and make sure to not throw exceptions when new metrics show up. The 3 patches linked in this task do exactly that.

@Dzahn waiting for your reviews!

akosiaris triaged this task as Medium priority.Feb 16 2022, 3:18 PM

Change 763231 merged by Dzahn:

[operations/debs/prometheus-etherpad-exporter@master] Make sure that metric_name is defined in all cases

https://gerrit.wikimedia.org/r/763231

Change 763238 merged by Dzahn:

[operations/debs/prometheus-etherpad-exporter@master] Add new metrics present in 1.8.16

https://gerrit.wikimedia.org/r/763238

Change 763239 merged by Dzahn:

[operations/debs/prometheus-etherpad-exporter@master] Release 0.6 prometheus-etherpad-exporter

https://gerrit.wikimedia.org/r/763239

Mentioned in SAL (#wikimedia-operations) [2022-02-16T21:54:40Z] <mutante> merged Alex's changes, built prometheus-etherpad-exporter_0.6 on deneb, imported on apt1001, ran reprepro export, installed new version on etherpad1003 T301872

[apt1001:~] $ sudo -E reprepro ls prometheus-etherpad-exporter
prometheus-etherpad-exporter | 0.3 |   buster-wikimedia | amd64, i386, source
prometheus-etherpad-exporter | 0.6 | bullseye-wikimedia | amd64, i386, source
[etherpad1003:~] $ dpkg -l | grep prometheus
ii  prometheus-etherpad-exporter         0.6                            all          Prometheus exporter for Etherpad Lite
[prometheus1005:~] $ curl etherpad1003:9198/metrics -v
*   Trying 2620:0:861:103:10:64:32:181:9198...
* connect to 2620:0:861:103:10:64:32:181 port 9198 failed: Connection refused
*   Trying 10.64.32.181:9198...
* Connected to etherpad1003 (10.64.32.181) port 9198 (#0)
> GET /metrics HTTP/1.1
> Host: etherpad1003:9198
> User-Agent: curl/7.74.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Date: Wed, 16 Feb 2022 21:55:44 GMT
< Server: WSGIServer/0.2 CPython/3.9.2
< Content-Type: text/plain; version=0.0.4; charset=utf-8
< Content-Length: 4036
< 
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 405.0
python_gc_objects_collected_total{generation="1"} 6.0
python_gc_objects_collected_total{generation="2"} 0.0
# HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 39.0
python_gc_collections_total{generation="1"} 3.0
python_gc_collections_total{generation="2"} 0.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="9",patchlevel="2",version="3.9.2"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.78855936e+08
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 2.0381696e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.64504835701e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.16999999999999998
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 6.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024.0
# HELP etherpad_scrape_duration_seconds Etherpad exporter scrape duration
# TYPE etherpad_scrape_duration_seconds summary
etherpad_scrape_duration_seconds_count 8.0
etherpad_scrape_duration_seconds_sum 4.258903209120035e-05
# HELP etherpad_scrape_duration_seconds_created Etherpad exporter scrape duration
# TYPE etherpad_scrape_duration_seconds_created gauge
etherpad_scrape_duration_seconds_created 1.645048357271543e+09
# HELP etherpad_up Etherpad is running
# TYPE etherpad_up gauge
etherpad_up 1.0
# HELP etherpad_pending_edits Number of pending edits
# TYPE etherpad_pending_edits gauge
etherpad_pending_edits 0.0
# HELP etherpad_memory_usage Memory consumption in bytes
# TYPE etherpad_memory_usage gauge
etherpad_memory_usage 2.93195776e+08
# HELP etherpad_memory_usage_heap Heap memory consumption in bytes
# TYPE etherpad_memory_usage_heap gauge
etherpad_memory_usage_heap 1.40815008e+08
# HELP etherpad_started Timestamp of etherpad startup
# TYPE etherpad_started gauge
etherpad_started 1.644710413335e+012
# HELP etherpad_last_disconnect Timestamp of the last disconnect
# TYPE etherpad_last_disconnect gauge
etherpad_last_disconnect 1.645048544362e+012
# HELP etherpad_total_users Number of total users
# TYPE etherpad_total_users gauge
etherpad_total_users 141.0
# HELP etherpad_active_pads Number of active pads
# TYPE etherpad_active_pads gauge
etherpad_active_pads 122.0
# HELP etherpad_connects_total Number of connections
# TYPE etherpad_connects_total counter
etherpad_connects_total 101185.0
# HELP etherpad_disconnects_total Number of disconnects
# TYPE etherpad_disconnects_total counter
etherpad_disconnects_total 101044.0
# HELP etherpad_ratelimits_total Number of rate limits
# TYPE etherpad_ratelimits_total counter
# HELP etherpad_failed_changesets_total Number of failed changesets
# TYPE etherpad_failed_changesets_total counter
etherpad_failed_changesets_total 72.0
# HELP etherpad_edits_total Number of edits
# TYPE etherpad_edits_total counter
etherpad_edits_total 483369.0
# HELP etherpad_http_reqs_total Number of HTTP requests
# TYPE etherpad_http_reqs_total counter
etherpad_http_reqs_total 30984.0
* Closing connection 0
[prometheus1005:~] $

@akosiaris :) reviewed / merged patches, built package on deneb, uploaded package on apt1001, imported with reprepro, exported indices, installed package on etherpad1003

@fgiunchedi see above and the curl output ^ :) I can't find instance=etherpad1003 in alertmanager (and have some questions about that) but it should mean there is no active alert and it's not just my silence I had tried add.

I kind of miss being able to paste a screenshot or link that shows it "green" in alertmanager, but I will optimistically call it resolved.