Page MenuHomePhabricator

prometheus on bast3002 misbehaving
Closed, InvalidPublic

Description

ops channel being spammed with CRITICAL icinga alerts from seemingly all the prometheus-based checks in esams, due to some issue with the prometheus sever instance on bast3002 (shouldn't these be UNKNOWNs in a case like this?).

Loadavg, iowait%, disk read traffic all seem high on bast3002. prometheus@ops had these in the journalctl log from near the time of the start of problems (but not exactly):

Apr 19 22:33:21 bast3002 prometheus@ops[9406]: time="2018-04-19T22:33:21Z" level=warning msg="Error while evaluating rule \"ALERT PrometheusReloadFailed\n\tIF prometheus_config_last_reload_successful == 0\n\tFOR 1h\n\tLABELS {severity=\\"warn\\"}\n\tANNOTATIONS {SUMMARY=\\"Prometheus {{$labels.instance}} config reload fail\\"}\": query timed out in query queue" source="manager.go:282"
Apr 19 22:33:21 bast3002 prometheus@ops[9406]: time="2018-04-19T22:33:21Z" level=warning msg="Error while evaluating rule \"ALERT InstanceDown\n\tIF up == 0\n\tFOR 3m\n\tLABELS {severity=\\"warn\\"}\n\tANNOTATIONS {SUMMARY=\\"Instance {{$labels.instance}}/{{$labels.job}} down\\"}\": query timed out in query queue" source="manager.go:282"

If it flaps again, I'll try restarting the prometheus service there and see what happens.

Event Timeline

BBlack triaged this task as High priority.Apr 19 2018, 10:49 PM
BBlack created this task.

It did keep spamming by the time I got done writing the above. Attempting to stop it now, but the basic daemon "stop" operation via systemctl is taking quite a long time (over 3 minutes and counting, even though parts of it have logged that they're gone):

● prometheus@ops.service - prometheus server (instance ops)
   Loaded: loaded (/lib/systemd/system/prometheus@ops.service; static)
   Active: deactivating (stop-sigterm) since Thu 2018-04-19 22:51:00 UTC; 3min 38s ago

Logs a bit before and then after requesting the shutdown above:

Apr 19 22:33:21 bast3002 prometheus@ops[9406]: time="2018-04-19T22:33:21Z" level=warning msg="Error while evaluating rule \"ALERT PrometheusRel
Apr 19 22:33:21 bast3002 prometheus@ops[9406]: time="2018-04-19T22:33:21Z" level=warning msg="Error while evaluating rule \"ALERT InstanceDown\
Apr 19 22:34:24 bast3002 prometheus@ops[9406]: time="2018-04-19T22:34:24Z" level=info msg="Checkpointing in-memory metrics and chunks..." sourc
Apr 19 22:39:42 bast3002 prometheus@ops[9406]: time="2018-04-19T22:39:42Z" level=info msg="Done checkpointing in-memory metrics and chunks in 5
Apr 19 22:44:59 bast3002 prometheus@ops[9406]: time="2018-04-19T22:44:59Z" level=info msg="Checkpointing in-memory metrics and chunks..." sourc
Apr 19 22:47:11 bast3002 prometheus@ops[9406]: time="2018-04-19T22:47:11Z" level=warning msg="Error while evaluating rule \"ALERT InstanceDown\
Apr 19 22:51:00 bast3002 systemd[1]: Stopping prometheus server (instance ops)...
-- Subject: Unit prometheus@ops.service has begun shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit prometheus@ops.service has begun shutting down.
Apr 19 22:51:00 bast3002 prometheus@ops[9406]: time="2018-04-19T22:51:00Z" level=warning msg="Received SIGTERM, exiting gracefully..." source="
Apr 19 22:51:00 bast3002 prometheus@ops[9406]: time="2018-04-19T22:51:00Z" level=info msg="See you next time!" source="main.go:243"
Apr 19 22:51:00 bast3002 prometheus@ops[9406]: time="2018-04-19T22:51:00Z" level=info msg="Stopping target manager..." source="targetmanager.go
Apr 19 22:51:04 bast3002 prometheus@ops[9406]: time="2018-04-19T22:51:04Z" level=info msg="Target manager stopped." source="targetmanager.go:89
Apr 19 22:51:04 bast3002 prometheus@ops[9406]: time="2018-04-19T22:51:04Z" level=info msg="Stopping rule manager..." source="manager.go:388"
Apr 19 22:51:10 bast3002 prometheus@ops[9406]: time="2018-04-19T22:51:10Z" level=info msg="Done checkpointing in-memory metrics and chunks in 6
Apr 19 22:52:02 bast3002 prometheus@ops[9406]: time="2018-04-19T22:52:02Z" level=info msg="Rule manager stopped." source="manager.go:394"
Apr 19 22:52:02 bast3002 prometheus@ops[9406]: time="2018-04-19T22:52:02Z" level=info msg="Stopping notification handler..." source="notifier.g
Apr 19 22:52:02 bast3002 prometheus@ops[9406]: time="2018-04-19T22:52:02Z" level=info msg="Stopping local storage..." source="storage.go:457"
Apr 19 22:52:02 bast3002 prometheus@ops[9406]: time="2018-04-19T22:52:02Z" level=info msg="Stopping maintenance loop..." source="storage.go:459
Apr 19 22:52:03 bast3002 prometheus@ops[9406]: time="2018-04-19T22:52:03Z" level=info msg="Maintenance loop stopped." source="storage.go:1463"
Apr 19 22:52:03 bast3002 prometheus@ops[9406]: time="2018-04-19T22:52:03Z" level=info msg="Stopping series quarantining..." source="storage.go:
Apr 19 22:52:03 bast3002 prometheus@ops[9406]: time="2018-04-19T22:52:03Z" level=info msg="Series quarantining stopped." source="storage.go:191
Apr 19 22:52:03 bast3002 prometheus@ops[9406]: time="2018-04-19T22:52:03Z" level=info msg="Stopping chunk eviction..." source="storage.go:467"
Apr 19 22:52:03 bast3002 prometheus@ops[9406]: time="2018-04-19T22:52:03Z" level=info msg="Chunk eviction stopped." source="storage.go:1158"
Apr 19 22:52:03 bast3002 prometheus@ops[9406]: time="2018-04-19T22:52:03Z" level=info msg="Checkpointing in-memory metrics and chunks..." sourc

And that was followed by this, by the time it finally stopped itself ~5 minutes later:

Apr 19 22:55:47 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:47Z" level=info msg="Done checkpointing in-memory metrics and chunks in 3
Apr 19 22:55:47 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:47Z" level=info msg="Checkpointing fingerprint mappings..." source="persi
Apr 19 22:55:48 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:48Z" level=info msg="Done checkpointing fingerprint mappings in 690.40272
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 05d932c8af845af9 (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 92fa66fb4cf5b4ac (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint b3bc5c659c5701f2 (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 67cc3bbc9996f80a (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 6c00f4c49f4d7f44 (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="The storage is now inconsistent. Restart Prometheus
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 4b2269a7a2bc0903 (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 77b10b7d10ad866b (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 3f98cf3aa324de5b (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 31f74eca61e77fc7 (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=info msg="Local storage stopped." source="storage.go:484"
Apr 19 22:55:50 bast3002 systemd[1]: Stopped prometheus server (instance ops).
-- Subject: Unit prometheus@ops.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit prometheus@ops.service has finished shutting down.

It seems to be having problems coming up cleanly too, so more spam. First chunk of startup logs:

Apr 19 22:58:02 bast3002 systemd[1]: Starting prometheus server (instance ops)...
-- Subject: Unit prometheus@ops.service has begun with start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit prometheus@ops.service has begun starting up.
Apr 19 22:58:02 bast3002 systemd[1]: Started prometheus server (instance ops).
-- Subject: Unit prometheus@ops.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit prometheus@ops.service has finished starting up.
-- 
-- The start-up result is done.
Apr 19 22:58:03 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:03Z" level=warning msg="Flag -storage.local.max-chunks-to-persist is deprecated. It has no effect." source="config.go:321"
Apr 19 22:58:03 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:03Z" level=warning msg="Flag -storage.local.memory-chunks is deprecated. Its value 1048576 is used to override -storage.local.target-heap-size to 3221225472." source="config.go:325"
Apr 19 22:58:03 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:03Z" level=info msg="Starting prometheus (version=1.8.1+ds+k8s, branch=debian/sid, revision=1.8.1+ds+k8s-1)" source="main.go:87"
Apr 19 22:58:03 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:03Z" level=info msg="Build context (go=go1.9.2, user=pkg-go-maintainers@lists.alioth.debian.org, date=20171110-08:34:34)" source="main.go:88"
Apr 19 22:58:03 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:03Z" level=info msg="Host details (Linux 4.9.0-0.bpo.3-amd64 #1 SMP Debian 4.9.25-1~bpo8+3 (2017-06-15) x86_64 bast3002 (none))" source="main.go:89"
Apr 19 22:58:03 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:03Z" level=info msg="Loading configuration file /srv/prometheus/ops/prometheus.yml" source="main.go:254"
Apr 19 22:58:03 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:03Z" level=info msg="Listening on 127.0.0.1:9900" source="web.go:317"
Apr 19 22:58:07 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:07Z" level=info msg="Loading series map and head chunks..." source="storage.go:428"
Apr 19 22:58:11 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:11Z" level=warning msg="Persistence layer appears dirty." source="persistence.go:846"
Apr 19 22:58:11 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:11Z" level=warning msg="Starting crash recovery. Prometheus is inoperational until complete." source="crashrecovery.go:40"
Apr 19 22:58:11 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:11Z" level=warning msg="To avoid crash recovery in the future, shut down Prometheus with SIGTERM or a HTTP POST to /-/quit." source="crashrecovery.go:41"
Apr 19 22:58:11 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:11Z" level=info msg="Scanning files." source="crashrecovery.go:55"
Apr 19 22:58:15 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:15Z" level=warning msg="Fingerprint 00826b4e106cc773 assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:17 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:17Z" level=warning msg="Fingerprint 00a3235f9b53186c assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:20 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:20Z" level=warning msg="Fingerprint 0019c85972311a74 assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:20 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:20Z" level=warning msg="Fingerprint 006529155f8574df assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:20 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:20Z" level=warning msg="Fingerprint 00dd7e63e03c6a44 assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:20 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:20Z" level=warning msg="Fingerprint 0062b726b1ceb527 assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:21 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:21Z" level=warning msg="Fingerprint 00c787cb9c8d522c assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:22 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:22Z" level=warning msg="Fingerprint 00d4bcc480935c7e assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:22 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:22Z" level=info msg="10000 files scanned." source="crashrecovery.go:77"
Apr 19 22:58:24 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:24Z" level=warning msg="Fingerprint 0092c0979f8958c4 assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:24 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:24Z" level=warning msg="Fingerprint 0050273a81303f71 assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:24 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:24Z" level=warning msg="Fingerprint 0075b28cc6a9e4a5 assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:24 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:24Z" level=warning msg="Fingerprint 00d7a8b11de09a8c assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:25 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:25Z" level=warning msg="Fingerprint 01c2739ffabec5bd assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:25 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:25Z" level=warning msg="Fingerprint 013ea435a8018e4a assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:26 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:26Z" level=warning msg="Fingerprint 01df166b363f1223 assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:26 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:26Z" level=warning msg="Fingerprint 0164d109a00631aa assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:26 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:26Z" level=warning msg="Fingerprint 0190afb306860d4c assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:27 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:27Z" level=warning msg="Fingerprint 01b96a3900a084ef assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:27 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:27Z" level=warning msg="Fingerprint 01e2f1aa2d9367a7 assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:27 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:27Z" level=warning msg="Fingerprint 0140a17a6cf72255 assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:27 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:27Z" level=warning msg="Fingerprint 01d444924e37a6dc assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:27 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:27Z" level=warning msg="Fingerprint 01bd45fef7ccfe69 assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:27 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:27Z" level=warning msg="Fingerprint 0186b28d9a5daeea assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:27 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:27Z" level=warning msg="Fingerprint 01d85e1bb88194b2 assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:27 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:27Z" level=warning msg="Fingerprint 010a9e68d5f2a46c assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:27 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:27Z" level=warning msg="Fingerprint 0167faf9c599db46 assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:27 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:27Z" level=warning msg="Fingerprint 011cb1d44e00a5ee assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:27 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:27Z" level=info msg="20000 files scanned." source="crashrecovery.go:77"
Apr 19 22:58:27 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:27Z" level=warning msg="Fingerprint 01a70ff1b864b3f2 assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:27 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:27Z" level=warning msg="Fingerprint 015085fcbc7e15fb assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"
Apr 19 22:58:28 bast3002 prometheus@ops[32101]: time="2018-04-19T22:58:28Z" level=warning msg="Fingerprint 014fb20fd9b70a1b assumed archived but couldn't be found in archived index." source="crashrecovery.go:394"

... like that for a long time ...

As of this writing, the tail is at:

Apr 19 23:08:01 bast3002 prometheus@ops[32101]: time="2018-04-19T23:08:01Z" level=info msg="620000 archived metrics checked." source="crashreco
Apr 19 23:08:01 bast3002 prometheus@ops[32101]: time="2018-04-19T23:08:01Z" level=info msg="630000 archived metrics checked." source="crashreco
Apr 19 23:08:01 bast3002 prometheus@ops[32101]: time="2018-04-19T23:08:01Z" level=info msg="640000 archived metrics checked." source="crashreco
Apr 19 23:08:02 bast3002 prometheus@ops[32101]: time="2018-04-19T23:08:02Z" level=info msg="650000 archived metrics checked." source="crashreco

[also, I've downtimed all the esams-specific prometheus-based alerts in icinga for 24h now (varnish child-counting checks and pybal bgp checks)]

Crash recovery appears to have completed at about 23:10:33 and things came back online. We'll see if it remains stable. Leaving the downtimes in place to avoid more spamming of IRC.

Thanks @BBlack for taking a look! Looks like an "heavy" query from varnish-failed-fetches drove disk utilization to 100% starting at 04/19T22:17 and things snowballed from there, with the icinga spam starting at 22:30. I suspect the disk (singular, sdb isn't in the array as per T169035) was busy loading the requested timeseries.

Unpacking the issues I see so far:

  1. CRITICAL vs UNKNOWN from check_prometheus_metric when running into a timeout, I'd argue CRITICAL is more visible but the resulting icinga spam isn't really a good thing so it makes sense to switch to UNKNOWN instead.
  2. Slow shutdown, this is expected as data is checkpointed and flushed to disk, possibly longer than usual given the disk was busy.
  3. Prometheus bailing/timing out on loading many series from disk and/or for long timespans. AIUI this is better/fixed in Prometheus 2 (upgrade task is T187987)

Actionables in the short term:

  1. Timeouts in check_prometheus_metric should result in UNKNOWN
  2. Audit varnish-failed-fetches to load less time series

Change 428725 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] s/CRITICAL/UNKNOWN/ for icinga->prometheus fetch/decode errors

https://gerrit.wikimedia.org/r/428725

Change 428725 merged by BBlack:
[operations/puppet@production] s/CRITICAL/UNKNOWN/ for icinga->prometheus fetch/decode errors

https://gerrit.wikimedia.org/r/428725

We haven't seen this reoccurring afaik, also we're upgrading to Prometheus 2.6, tentatively resolving.