Page MenuHomePhabricator

Prometheus: queries matching on {__name__} error out on larger instances
Closed, DeclinedPublic

Description

I noticed while experimenting with a metrics cardinality view dashboard that some panels fail to render https://grafana-rw.wikimedia.org/d/b0b89a23-8f37-4bfe-962a-f329a654e987/prometheus-metrics-cardinality-management?orgId=1&var-datasource=drmrs%20prometheus%2Fops&var-metric=ALERTS&var-job=All&from=now-5m&to=now

These are queries for overall for counts of metrics, etc. For example fetching a count of all metrics via {__name__!=""} on ops eqiad or matching istio.* with {__name__=~"istio.*"} on k8s eqiad. These fail to execute when run against the prometheus instance directly via prometheus web with error:

Error executing query: Unexpected token '<', " <!DOCTYPE "... is not valid JSON

Can also confirm that these queries do complete on smaller instances like ops drmrs.

Event Timeline

$ curl -g 'http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query={__name__!=""}' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   128  100   128    0     0      1      0  0:02:08  0:01:54  0:00:14    28
{
  "status": "error",
  "errorType": "execution",
  "error": "query processing would load too many samples into memory in query execution"
}

querying k8s eqiad for {__name__=~"istio.*"} in prometheus web I'm seeing two errors occur

sometimes after ~30s Error executing query: Unexpected token '<', " <!DOCTYPE "... is not valid JSON

sometimes after ~50s I get an "aw snap this page failed to load"

however via the api the query does return after ~35s, results are huge (~1.6G)

$ curl -g 'http://prometheus.svc.eqiad.wmnet/k8s/api/v1/query?query={__name__=~"istio.*"}' -o istio_metrics.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1564M    0 1564M    0     0  44.0M      0 --:--:--  0:00:35 --:--:--  366M
herron changed the task status from Open to Stalled.Jun 18 2025, 2:41 PM
herron triaged this task as Low priority.

declining this since I doubt we'll make changes to support these queries, and we can hack around it using results from the labels api