Page MenuHomePhabricator

Review of log level settings for prometheus-blackbox-exporter and thanos-query
Closed, ResolvedPublic

Description

Problem:

The storage drives on our centrallog instances are filling up rapidly, primarily due to the volume of logs with log level debug being generated by the prometheus-blackbox-exporter, thanos-fe-query, and thanos-query.

This behavior is expected because it's the level assigned in the code for:

Proposed Solutions:

  1. Reduce the log level to error.
  2. Remove logs older than 30 days.

Additionally, I think that option 2 can be implemented with two conditionals in mind, one being the free space left on the instance and the other one the time period.

Your input on this matter would be greatly appreciated.

Event Timeline

Thank you @andrea.denisse for filing the task! I'm thinking of reverting the thanos debug logging in T356788: thanos-query probedown due to OOM of both eqiad titan frontends since we have a better idea of problematic queries now. blackbox-exporter logs though will need to stay at debug level since they are used for debugging alerts themselves (e.g. ProbeDown has a link to the blackbox-exporter logs in logstash)

Thank you @andrea.denisse for filing the task! I'm thinking of reverting the thanos debug logging in T356788: thanos-query probedown due to OOM of both eqiad titan frontends since we have a better idea of problematic queries now.

will be done with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1007332 and then we should be back to normal space utilization (after the retention period that is)