Page MenuHomePhabricator

Effects on adjusting Prometheus retention
Closed, ResolvedPublic

Description

Prometheus codfw was running low on disk space, I've adjusted the retention period in https://gerrit.wikimedia.org/r/#/c/342810/. Though the additional i/o bandwidth required for maintenance exceeded what was available from spinning disks in ganeti codfw.

Also for some reason prometheus in esams upon restart was found with one of its leveldb databases corrupted.

  • I've bandaided prometheus2001 by giving it more memory for chunks and other tuning, this needs to be puppetized as well
  • esams Prometheus had accumulated a large number of metrics over time, likely due to varnish/vcl metric churn documented at T150479 and was taking a long time to recover, I've moved aside its metrics and started fresh
  • The alerts on prometheus machines should catch these conditions ("rushed mode")

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 16 2017, 6:21 PM
fgiunchedi updated the task description. (Show Details)Mar 16 2017, 7:01 PM
fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Mar 23 2017, 11:48 AM
fgiunchedi triaged this task as Normal priority.Apr 10 2017, 1:01 PM
fgiunchedi moved this task from Doing to Backlog on the User-fgiunchedi board.May 8 2017, 2:11 PM
fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Jan 16 2018, 11:21 AM

Change 404434 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: bump global retention to 15 months

https://gerrit.wikimedia.org/r/404434

Change 404434 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: bump global retention to 15 months

https://gerrit.wikimedia.org/r/404434

Mentioned in SAL (#wikimedia-operations) [2018-01-31T14:37:13Z] <godog> bump prometheus global instance retention to 15 months - T160677

Change 407011 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: default to storage encoding version 2

https://gerrit.wikimedia.org/r/407011

Change 407011 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: default to storage encoding version 2

https://gerrit.wikimedia.org/r/407011

Change 412660 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: default retention to 24 weeks

https://gerrit.wikimedia.org/r/412660

Change 412660 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: default retention to 24 weeks

https://gerrit.wikimedia.org/r/412660

fgiunchedi closed this task as Resolved.Feb 20 2018, 10:28 AM
fgiunchedi claimed this task.

I bumped the minimum retention period to six months for all instances and no adverse effects observed so far, I'm tentatively resolving this task as the behaviour described hasn't reoccured in recent Prometheus versions.