Page MenuHomePhabricator

prometheus: current query limits are insufficient to prevent OOMs
Closed, ResolvedPublic

Description

(part of https://wikitech.wikimedia.org/wiki/Incident_documentation/20190425-prometheus)

It's possible to nearly OOM the eqiad ops prometheuses just by loading long-enough history of certain grafana dashboards.

This is despite the fact that we're using the default settings for query.timeout, query.max-concurrency, and query.max-samples, which should be more than sufficient given a 94G server (some explanation at https://www.robustperception.io/limiting-promql-resource-usage).

Probably the first thing to try is cutting query.max-samples to something like a third of its current value?

Details

Related Gerrit Patches:
operations/puppet : productionprometheus: 10M max-samples for all instances

Event Timeline

CDanis created this task.Apr 29 2019, 8:19 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 29 2019, 8:19 PM

Change 507210 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] prometheus: 10M max-samples for all instances

https://gerrit.wikimedia.org/r/507210

Mentioned in SAL (#wikimedia-operations) [2019-04-30T12:32:41Z] <cdanis> cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'R:prometheus::server' 'disable-puppet "staged rollout T222105 by cdanis"'

Change 507210 merged by CDanis:
[operations/puppet@production] prometheus: 10M max-samples for all instances

https://gerrit.wikimedia.org/r/507210

Mentioned in SAL (#wikimedia-operations) [2019-04-30T12:39:05Z] <cdanis> cdanis@prometheus1004.eqiad.wmnet ~ % sudo run-puppet-agent --enable "staged rollout T222105 by cdanis"

Mentioned in SAL (#wikimedia-operations) [2019-04-30T12:47:01Z] <cdanis> cdanis@prometheus1003.eqiad.wmnet ~ % sudo run-puppet-agent --enable "staged rollout T222105 by cdanis"

Mentioned in SAL (#wikimedia-operations) [2019-04-30T13:15:09Z] <cdanis> cdanis@prometheus1003.eqiad.wmnet ~ % sudo disable-puppet 'cdanis testing original query.max-samples T222105'

Mentioned in SAL (#wikimedia-operations) [2019-04-30T14:15:16Z] <cdanis> cdanis@prometheus1003.eqiad.wmnet ~ % sudo enable-puppet 'cdanis testing original query.max-samples T222105'

Mentioned in SAL (#wikimedia-operations) [2019-04-30T14:17:04Z] <cdanis> cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'prometheus2003*' 'run-puppet-agent --enable "staged rollout T222105 by cdanis"'

Mentioned in SAL (#wikimedia-operations) [2019-04-30T14:24:30Z] <cdanis> cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'prometheus2004*' 'run-puppet-agent --enable "staged rollout T222105 by cdanis"'

Mentioned in SAL (#wikimedia-operations) [2019-04-30T14:43:25Z] <cdanis> cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'bast5001*' 'run-puppet-agent --enable "staged rollout T222105 by cdanis"'

Mentioned in SAL (#wikimedia-operations) [2019-04-30T14:49:18Z] <cdanis> cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'labmon1001*' 'run-puppet-agent --enable "staged rollout T222105 by cdanis"'

CDanis closed this task as Resolved.Apr 30 2019, 3:08 PM
CDanis claimed this task.

As documented in T222112#5147131 this didn't actually fix the dashboard at fault in this particular incident, but I've heard from another large-scale Prometheus user (and Prometheus dev) that they've had similar problems and recommend 10M as a value.