Page MenuHomePhabricator

prometheus: current query limits are insufficient to prevent OOMs
Closed, ResolvedPublic

Description

(part of https://wikitech.wikimedia.org/wiki/Incident_documentation/20190425-prometheus)

It's possible to nearly OOM the eqiad ops prometheuses just by loading long-enough history of certain grafana dashboards.

This is despite the fact that we're using the default settings for query.timeout, query.max-concurrency, and query.max-samples, which should be more than sufficient given a 94G server (some explanation at https://www.robustperception.io/limiting-promql-resource-usage).

Probably the first thing to try is cutting query.max-samples to something like a third of its current value?

Event Timeline

Change 507210 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] prometheus: 10M max-samples for all instances

https://gerrit.wikimedia.org/r/507210

Mentioned in SAL (#wikimedia-operations) [2019-04-30T12:32:41Z] <cdanis> cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'R:prometheus::server' 'disable-puppet "staged rollout T222105 by cdanis"'

Change 507210 merged by CDanis:
[operations/puppet@production] prometheus: 10M max-samples for all instances

https://gerrit.wikimedia.org/r/507210

Mentioned in SAL (#wikimedia-operations) [2019-04-30T12:39:05Z] <cdanis> cdanis@prometheus1004.eqiad.wmnet ~ % sudo run-puppet-agent --enable "staged rollout T222105 by cdanis"

Mentioned in SAL (#wikimedia-operations) [2019-04-30T12:47:01Z] <cdanis> cdanis@prometheus1003.eqiad.wmnet ~ % sudo run-puppet-agent --enable "staged rollout T222105 by cdanis"

Mentioned in SAL (#wikimedia-operations) [2019-04-30T13:15:09Z] <cdanis> cdanis@prometheus1003.eqiad.wmnet ~ % sudo disable-puppet 'cdanis testing original query.max-samples T222105'

Mentioned in SAL (#wikimedia-operations) [2019-04-30T14:15:16Z] <cdanis> cdanis@prometheus1003.eqiad.wmnet ~ % sudo enable-puppet 'cdanis testing original query.max-samples T222105'

Mentioned in SAL (#wikimedia-operations) [2019-04-30T14:17:04Z] <cdanis> cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'prometheus2003*' 'run-puppet-agent --enable "staged rollout T222105 by cdanis"'

Mentioned in SAL (#wikimedia-operations) [2019-04-30T14:24:30Z] <cdanis> cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'prometheus2004*' 'run-puppet-agent --enable "staged rollout T222105 by cdanis"'

Mentioned in SAL (#wikimedia-operations) [2019-04-30T14:43:25Z] <cdanis> cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'bast5001*' 'run-puppet-agent --enable "staged rollout T222105 by cdanis"'

Mentioned in SAL (#wikimedia-operations) [2019-04-30T14:49:18Z] <cdanis> cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'labmon1001*' 'run-puppet-agent --enable "staged rollout T222105 by cdanis"'

CDanis claimed this task.

As documented in T222112#5147131 this didn't actually fix the dashboard at fault in this particular incident, but I've heard from another large-scale Prometheus user (and Prometheus dev) that they've had similar problems and recommend 10M as a value.