Page MenuHomePhabricator

Effects on adjusting Prometheus retention
Closed, ResolvedPublic

Description

Prometheus codfw was running low on disk space, I've adjusted the retention period in https://gerrit.wikimedia.org/r/#/c/342810/. Though the additional i/o bandwidth required for maintenance exceeded what was available from spinning disks in ganeti codfw.

Also for some reason prometheus in esams upon restart was found with one of its leveldb databases corrupted.

  • I've bandaided prometheus2001 by giving it more memory for chunks and other tuning, this needs to be puppetized as well
  • esams Prometheus had accumulated a large number of metrics over time, likely due to varnish/vcl metric churn documented at T150479 and was taking a long time to recover, I've moved aside its metrics and started fresh
  • The alerts on prometheus machines should catch these conditions ("rushed mode")

Event Timeline

fgiunchedi triaged this task as Medium priority.Apr 10 2017, 1:01 PM

Change 404434 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: bump global retention to 15 months

https://gerrit.wikimedia.org/r/404434

Change 404434 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: bump global retention to 15 months

https://gerrit.wikimedia.org/r/404434

Mentioned in SAL (#wikimedia-operations) [2018-01-31T14:37:13Z] <godog> bump prometheus global instance retention to 15 months - T160677

Change 407011 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: default to storage encoding version 2

https://gerrit.wikimedia.org/r/407011

Change 407011 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: default to storage encoding version 2

https://gerrit.wikimedia.org/r/407011

Change 412660 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: default retention to 24 weeks

https://gerrit.wikimedia.org/r/412660

Change 412660 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: default retention to 24 weeks

https://gerrit.wikimedia.org/r/412660

fgiunchedi claimed this task.

I bumped the minimum retention period to six months for all instances and no adverse effects observed so far, I'm tentatively resolving this task as the behaviour described hasn't reoccured in recent Prometheus versions.