Page MenuHomePhabricator

[tools] prometheus out of space
Closed, ResolvedPublic

Description

Write the description below

We are getting a lot of errors on crons due to prometheus being out of space:

Date: Tue, 13 Apr 2021 07:50:01 +0000From: Cron Daemon <root@tools.wmflabs.org>
To: prometheus@tools.wmflabs.orgSubject: Cron <prometheus@tools-prometheus-03> /usr/local/bin/prometheus-labs-targets --port 22 > /srv/prometheus/tools/targets/ssh_banner.$$ && mv /srv/prometheus/tools/targets/ssh_banner.$$ /srv/prometheus/tools/targets/ssh_banner.yml

/bin/sh: 1: cannot create /srv/prometheus/tools/targets/ssh_banner.21306: No space left on device

This task is to try to apply a first fix and if needed identify and create the other tasks to apply a more permanent
fix and/or prevent the issue.

Event Timeline

dcaro triaged this task as High priority.Apr 13 2021, 8:15 AM
dcaro created this task.

This is the first entry in the prometheus logs complaining about disk space:

Apr 12 03:02:24 tools-prometheus-03 prometheus@tools[621]: level=warn ts=2021-04-12T03:02:24.98240282Z caller=scrape.go:835 component="scrape manager" scrape_pool=node target=http://tools-redis-1003:9100/metrics msg="append failed" err="write to WAL: log samples: write /srv/prometheus/tools/metrics/wal/00079683: no space left on device"
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: panic: runtime error: slice bounds out of range
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: goroutine 13354038 [running]:
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: github.com/prometheus/tsdb/wal.(*WAL).log(0xc0003086c0, 0xc1acb3e000, 0x43, 0x3c000, 0xc083dde001, 0xc1acb3e000, 0x43)
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: #011github.com/prometheus/tsdb/wal/wal.go:502 +0x406
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: github.com/prometheus/tsdb/wal.(*WAL).Log(0xc0003086c0, 0xc296aa37a0, 0x1, 0x1, 0x0, 0x0)
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: #011github.com/prometheus/tsdb/wal/wal.go:471 +0xd4
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: github.com/prometheus/tsdb.(*headAppender).log(0xc14f9bcbe0, 0x0, 0x0)
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: #011github.com/prometheus/tsdb/head.go:785 +0x1c0
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: github.com/prometheus/tsdb.(*headAppender).Commit(0xc14f9bcbe0, 0x0, 0x0)
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: #011github.com/prometheus/tsdb/head.go:796 +0xd4
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: github.com/prometheus/tsdb.dbAppender.Commit(0x2029600, 0xc14f9bcbe0, 0xc0005a9ba0, 0x20000c296aa3900, 0xffffffffffffffff)
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: #011github.com/prometheus/tsdb/db.go:369 +0x35
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: github.com/prometheus/prometheus/storage/tsdb.appender.Commit(0x2029b80, 0xc259b5c820, 0xc2196bf8c0, 0x3194290)
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: #011github.com/prometheus/prometheus/storage/tsdb/tsdb.go:306 +0x31
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: github.com/prometheus/prometheus/storage.(*fanoutAppender).Commit(0xc10f5cfec0, 0x0, 0xffffffffffffffff)
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: #011github.com/prometheus/prometheus/storage/fanout.go:163 +0x49
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: github.com/prometheus/prometheus/scrape.(*scrapeLoop).append(0xc107345000, 0xc205c6c000, 0x2c6, 0x3e8, 0xc0a78f7aa0, 0x28, 0xc0150bf0473166a2, 0xd95ba087c581f, 0x31a4000, 0x0, ...)
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: #011github.com/prometheus/prometheus/scrape/scrape.go:1113 +0xa62
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: github.com/prometheus/prometheus/scrape.(*scrapeLoop).run(0xc107345000, 0xdf8475800, 0x2540be400, 0x0)
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: #011github.com/prometheus/prometheus/scrape/scrape.go:833 +0x592
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: created by github.com/prometheus/prometheus/scrape.(*scrapePool).sync
Apr 12 03:02:25 tools-prometheus-03 prometheus@tools[621]: #011github.com/prometheus/prometheus/scrape/scrape.go:336 +0x45d

Before that, there's many entries like:

Apr 12 03:02:24 tools-prometheus-03 prometheus@tools[621]: level=warn ts=2021-04-12T03:02:24.2560459Z caller=scrape.go:1088 component="scrape manager" scrape_pool=new-k8s-cadvisor target=https://k8s.tools.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/metrics/pods/cadvisor-q762b/proxy/metrics msg="Error on ingesting out-of-order samples" num_dropped=193

most of them from that same pool/target. That seems to have been the cause to make the WAL grow, looking what causes it.

The oldest of the prometheus logs already has that error entry:

Apr  9 00:01:10 tools-prometheus-03 prometheus@tools[621]: level=warn ts=2021-04-09T00:01:10.192386627Z caller=scrape.go:1088 component="scrape manager" scrape_pool=new-k8s-cadvisor target=https://k8s.tools.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/metrics/pods/cadvisor-kl5vd/proxy/metrics msg="Error on ingesting out-of-order samples" num_dropped=193

It seems very similar to https://github.com/prometheus/prometheus/issues/3487, will try removing the *.tmp directory.

Ok, seems promising:

Apr 13 08:31:10 tools-prometheus-03 prometheus@tools[3121]: level=warn ts=2021-04-13T08:31:10.012202798Z caller=head.go:488 component=tsdb msg="encountered WAL error, attempting repair" err="read records: corruption in segment /srv/prometheus/tools/metrics/wal/00078200 at 133791744: unexpected non-zero byte in padded page"
Apr 13 08:31:10 tools-prometheus-03 prometheus@tools[3121]: level=warn ts=2021-04-13T08:31:10.013505853Z caller=wal.go:291 component=tsdb msg="starting corruption repair" segment=78200 offset=133791744
Apr 13 08:31:10 tools-prometheus-03 prometheus@tools[3121]: level=warn ts=2021-04-13T08:31:10.015384337Z caller=wal.go:299 component=tsdb msg="deleting all segments behind corruption" segment=78200
Apr 13 08:31:12 tools-prometheus-03 prometheus@tools[3121]: level=warn ts=2021-04-13T08:31:12.983490871Z caller=wal.go:321 component=tsdb msg="rewrite corrupted segment" segment=78200
Apr 13 08:31:17 tools-prometheus-03 prometheus@tools[3121]: level=info ts=2021-04-13T08:31:17.903618654Z caller=main.go:638 msg="TSDB started"
Apr 13 08:31:17 tools-prometheus-03 prometheus@tools[3121]: level=info ts=2021-04-13T08:31:17.903710893Z caller=main.go:698 msg="Loading configuration file" filename=/srv/prometheus/tools/prometheus.yml
Apr 13 08:31:17 tools-prometheus-03 prometheus@tools[3121]: level=info ts=2021-04-13T08:31:17.926651351Z caller=main.go:725 msg="Completed loading of configuration file" filename=/srv/prometheus/tools/prometheus.yml
Apr 13 08:31:17 tools-prometheus-03 prometheus@tools[3121]: level=info ts=2021-04-13T08:31:17.926691502Z caller=main.go:592 msg="Server is ready to receive web requests."
Apr 13 08:31:26 tools-prometheus-03 prometheus@tools[3121]: level=info ts=2021-04-13T08:31:26.264035986Z caller=head.go:526 component=tsdb msg="head GC completed" duration=45.742775ms
Apr 13 08:31:36 tools-prometheus-03 prometheus@tools[3121]: level=info ts=2021-04-13T08:31:36.532913118Z caller=head.go:573 component=tsdb msg="WAL checkpoint complete" first=78186 last=78190 duration=10.268730132s

Yep:

root@tools-prometheus-03:/srv/prometheus/tools/metrics# du -hs wal/
467M    wal/

root@tools-prometheus-03:/srv/prometheus/tools/metrics# df -h /srv
Filesystem                          Size  Used Avail Use% Mounted on
/dev/mapper/vd-second--local--disk  276G   76G  187G  29% /srv

Will keep an eye for a bit, but that seemed to have done the trick.

The errors on out-of-order series keep happening, and the WAL is steadily growing:

root@tools-prometheus-03:/srv/prometheus/tools/metrics# du -hs wal/
1.3G    wal/
root@tools-prometheus-03:/srv/prometheus/tools/metrics# du -hs wal/
1.4G    wal/

We have lost ~1d of metrics it seems.

I've added a new grafana dashboard with details on prometheus metrics (imported from the official dashboard), will
tweak that and add/remove stuff.

https://grafana-labs.wikimedia.org/d/tools-prometheus-details/tools-prometheus-detailed-dashboard?orgId=1&refresh=30s&from=now-2d&to=now

The top ten submitters (topk(3,((ceph_pool_stored / (ceph_pool_stored + ceph_pool_max_avail)) * on(pool_id) group_left(name) ceph_pool_metadata)))

Element     Value
nginx_ingress_controller_request_duration_seconds_bucket{}    144588
nginx_ingress_controller_response_size_bucket{}    144588
node_systemd_unit_state{}    139465
nginx_ingress_controller_response_duration_seconds_bucket{}    139032
nginx_ingress_controller_request_size_bucket{}    132539
nginx_ingress_controller_bytes_sent_bucket{}    96392
kube_resourcequota{}    63580
node_systemd_unit_start_time_seconds{}    27893
container_network_tcp_usage_total{}    27115
kube_limitrange{}    23120

We can see clearly in the graph for used space that the issue started around the 5th of april, and continuously drained
the space over 6 days, so from that rate we can expect to be out of space again ~20th of april:

https://grafana-labs.wikimedia.org/goto/IeWmH0XMk

Mentioned in SAL (#wikimedia-cloud) [2021-04-16T14:38:42Z] <dcaro> added 'will get out of space in X days' panel to the dasboard https://grafana-labs.wikimedia.org/goto/kBlGd0uGk (T279990), we got <5days xd

Added also the estimated max size of the WAL directory, it's currently ~64GB, and that's it's current size:

# du -hs /srv/prometheus/tools/metrics/wal/
64G     /srv/prometheus/tools/metrics/wal/

So let's see on monday if the disk usage stabilizes or continues growing :crossingfingers:

Top 10 metrics before the 5th of April:

node_systemd_unit_state{}       140480
nginx_ingress_controller_response_size_bucket{} 136740
nginx_ingress_controller_request_duration_seconds_bucket{}      136740
nginx_ingress_controller_response_duration_seconds_bucket{}     131508
nginx_ingress_controller_request_size_bucket{}  125345
nginx_ingress_controller_bytes_sent_bucket{}    91160
kube_resourcequota{}    63404
node_systemd_unit_start_time_seconds{}  28096
container_network_tcp_usage_total{}     26994
kube_limitrange{}       23056

That is very similar. Next step is try to see if there's anyone creating metrics with keys, like metric_1, metric_2
and so on.

Mentioned in SAL (#wikimedia-cloud) [2021-04-20T10:18:16Z] <dcaro> seting the retention on the tools-prometheus VMs to 250GB (they have 276GB total, leaving some space for online data operations if needed) (T279990)

Storage usage has been stable now for a couple days, declaring this resolved, will reopen if it happens again.