Page MenuHomePhabricator

prometheus-openstack-exporter: collected data shows regular null intervals
Closed, ResolvedPublic

Description

See screenshot:

image.png (887×1 px, 96 KB)

This happens with other metrics generated by the same exporter.

My only hint is that currently the exporter takes a lot of time to return the data, see:

aborrero@cloudcontrol1007:~ $ curl localhost:12345/metrics -o metrics.prom
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2980k    0 2980k    0     0  45469      0 --:--:--  0:01:07 --:--:--  774k

It took curl 1:07 to complete the query. I wonder if the Prometheus scraper is somehow timing out from time to time.

Related Objects

StatusSubtypeAssignedTask
ResolvedAndrew
ResolvedAndrew
Resolvedrook
ResolvedAndrew
Resolvedaborrero
Resolveddcaro
In ProgressNone
ResolvedAndrew
ResolvedAndrew
OpenAndrew
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenAndrew
OpenNone
OpenAndrew
Resolvedaborrero
Resolvedaborrero

Event Timeline

aborrero changed the task status from Open to In Progress.May 4 2023, 11:38 AM
aborrero triaged this task as Low priority.

Now discovering this https://gerrit.wikimedia.org/r/c/operations/puppet/+/802434 and this https://gerrit.wikimedia.org/r/c/operations/puppet/+/802956

As of this writing the scrape interval is set to 15m. Maybe the gaps in the data comes directly from the long scrape interval.

Change 915385 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m""

https://gerrit.wikimedia.org/r/915385

Change 915385 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m""

https://gerrit.wikimedia.org/r/915385

The data shows no gap after merging the patch:

image.png (909×2 px, 94 KB)

please @Andrew be aware of this change as it might impact openstack stability. I don't think you need to do anything at the moment, just keep an eye open for any signals of instability.

I am pretty sure that the openstack APIs have gotten much slower... horizon times out for me now and then. Is it possible to reduce the number of metrics gathered rather than the frequency of checks?

I am pretty sure that the openstack APIs have gotten much slower... horizon times out for me now and then. Is it possible to reduce the number of metrics gathered rather than the frequency of checks?

Ok, I'll revert while we find a solution.

Change 915713 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] Revert "Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"""

https://gerrit.wikimedia.org/r/915713

Change 915713 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] Revert "Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"""

https://gerrit.wikimedia.org/r/915713

What about increasing API HTTP workers for the most busy endpoints? @Andrew what do you think is the bottleneck here?

aborrero added a project: User-aborrero.

hey @Andrew do you think we are ready to experiment with increasing the frequency again given T336379: Openstack API slowdowns is now completed?

WMCS meeting: try to have a clear view of API performance before turning this on again. This can be, for example, HAproxy backend response time.

Change 933451 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] wmcs: prometheus: increase scrape frequency for openstack APIs

https://gerrit.wikimedia.org/r/933451

Change 933451 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] wmcs: prometheus: increase scrape frequency for openstack APIs

https://gerrit.wikimedia.org/r/933451

Change 933477 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/alerts@master] team-wmcs: add openstack_apis_response.yaml

https://gerrit.wikimedia.org/r/933477

Change 933477 merged by Arturo Borrero Gonzalez:

[operations/alerts@master] team-wmcs: add openstack_apis_response.yaml

https://gerrit.wikimedia.org/r/933477