Page MenuHomePhabricator

Audit dashboards using histogram_quantile on big envoy metrics and move to recording rules
Closed, DeclinedPublic

Description

The following dashboards/panels make use of quantile against envoy metrics and are candidates to move to recording rules. Note we may not have all labels selected by recording rules already. In which case we can make new recording rules

Summary

Searched for quantile.*envoy and found 30 matching dashboards and 0 matching alerts.

  • Dashboard API and REST Gateway
    • Panel Request time quantiles
      • histogram_quantile(0.1, sum(rate(envoy_http_downstream_rq_time_bucket{envoy_http_conn_manager_prefix="ingress_http", kubernetes_namespace="$instance"}[5m])) by (le))
      • histogram_quantile(0.5, sum(rate(envoy_http_downstream_rq_time_bucket{envoy_http_conn_manager_prefix="ingress_http", kubernetes_namespace="$instance"}[5m])) by (le))
      • histogram_quantile(0.9, sum(rate(envoy_http_downstream_rq_time_bucket{envoy_http_conn_manager_prefix="ingress_http", kubernetes_namespace="$instance"}[5m])) by (le))
      • histogram_quantile(0.95, sum(rate(envoy_http_downstream_rq_time_bucket{envoy_http_conn_manager_prefix="ingress_http", kubernetes_namespace="$instance"}[5m])) by (le))
    • Panel Request latency - 99th percentile read
      • histogram_quantile(0.99, sum(rate(envoy_vhost_vcluster_upstream_rq_time_bucket{envoy_virtual_cluster="r",kubernetes_namespace="api-gateway"}[5m])) by (le))
    • Panel Request latency - 99th percentile write
      • histogram_quantile(0.99, sum(irate(envoy_vhost_vcluster_upstream_rq_time_bucket{envoy_virtual_cluster="rw",kubernetes_namespace="api-gateway", envoy_cluster_name!~"admin_cluster|rate_limit_cluster|admin"}[5m])) by (le))
    • Panel Proxy time
      • histogram_quantile(0.99, sum(irate(envoy_cluster_internal_upstream_rq_time_bucket{kubernetes_namespace="api-gateway", envoy_cluster_name!~"admin_cluster|admin"}[5m])) by (le))
    • Panel Upstream latency percentiles (10m avg)
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace="$instance"}[10m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace="$instance"}[10m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.9, sum(rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace="$instance"}[10m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace="$instance"}[10m])) by (le, envoy_cluster_name))
  • Dashboard API and REST Gateway - jgiannelos
    • Panel Request time quantiles
      • histogram_quantile(0.1, sum(rate(envoy_http_downstream_rq_time_bucket{envoy_http_conn_manager_prefix="ingress_http", kubernetes_namespace="$instance"}[5m])) by (le))
      • histogram_quantile(0.5, sum(rate(envoy_http_downstream_rq_time_bucket{envoy_http_conn_manager_prefix="ingress_http", kubernetes_namespace="$instance"}[5m])) by (le))
      • histogram_quantile(0.9, sum(rate(envoy_http_downstream_rq_time_bucket{envoy_http_conn_manager_prefix="ingress_http", kubernetes_namespace="$instance"}[5m])) by (le))
      • histogram_quantile(0.95, sum(rate(envoy_http_downstream_rq_time_bucket{envoy_http_conn_manager_prefix="ingress_http", kubernetes_namespace="$instance"}[5m])) by (le))
    • Panel Request latency - 99th percentile read
      • histogram_quantile(0.99, sum(rate(envoy_vhost_vcluster_upstream_rq_time_bucket{envoy_virtual_cluster="r",kubernetes_namespace="api-gateway"}[5m])) by (le))
    • Panel Request latency - 99th percentile write
      • histogram_quantile(0.99, sum(irate(envoy_vhost_vcluster_upstream_rq_time_bucket{envoy_virtual_cluster="rw",kubernetes_namespace="api-gateway", envoy_cluster_name!~"admin_cluster|rate_limit_cluster|admin"}[5m])) by (le))
    • Panel Proxy time
      • histogram_quantile(0.99, sum(irate(envoy_cluster_internal_upstream_rq_time_bucket{kubernetes_namespace="api-gateway", envoy_cluster_name!~"admin_cluster|admin"}[5m])) by (le))
    • Panel Upstream latency percentiles (10m avg)
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace="$instance"}[10m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace="$instance", cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[10m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.9, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[10m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[10m])) by (le, envoy_cluster_name))
  • Dashboard API Gateway SLO (Draft)
    • Panel Request latency - 99th percentile read
      • histogram_quantile(0.99, sum(rate(envoy_vhost_vcluster_upstream_rq_time_bucket{envoy_virtual_cluster="r",kubernetes_namespace="api-gateway"}[5m])) by (le))
    • Panel Request latency - 99th percentile write
      • histogram_quantile(0.99, sum(rate(envoy_vhost_vcluster_upstream_rq_time_bucket{envoy_virtual_cluster="rw",kubernetes_namespace="api-gateway"}[5m])) by (le))
  • Dashboard Cache Hosts Comparison
  • Dashboard Envoy Telemetry
    • Panel Upstream latency percentiles
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[2m])) by (le))
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[2m])) by (le))
      • histogram_quantile(0.9, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[2m])) by (le))
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[2m])) by (le))
    • Panel Downstream latency percentiles
      • histogram_quantile(0.5, sum(rate(envoy_http_downstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[2m])) by (le))
      • histogram_quantile(0.75, sum(rate(envoy_http_downstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[2m])) by (le))
      • histogram_quantile(0.9, sum(rate(envoy_http_downstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[2m])) by (le))
      • histogram_quantile(0.99, sum(rate(envoy_http_downstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[2m])) by (le))
    • Panel Upstream latency percentiles by destination
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.9, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[2m])) by (le, envoy_cluster_name))
    • Panel Downstream latency percentiles by destination
      • histogram_quantile(0.5, sum(rate(envoy_http_downstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.75, sum(rate(envoy_http_downstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.9, sum(rate(envoy_http_downstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.99, sum(rate(envoy_http_downstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$origin_instance"}[2m])) by (le, envoy_cluster_name))
  • Dashboard Envoy Telemetry (k8s)
    • Panel Upstream latency percentiles
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace", envoy_cluster_name=~"$destination", envoy_cluster_name!="admin_interface"}[2m])) by (l…
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace", envoy_cluster_name=~"$destination", envoy_cluster_name!="admin_interface"}[2m])) by (…
      • histogram_quantile(0.9, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace", envoy_cluster_name=~"$destination", envoy_cluster_name!="admin_interface"}[2m])) by (l…
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace", envoy_cluster_name=~"$destination", envoy_cluster_name!="admin_interface"}[2m])) by (…
    • Panel Downstream latency percentiles
      • histogram_quantile(0.5, sum(rate(envoy_http_downstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace", envoy_http_conn_manager_prefix!~"admin|admin_interface"}[2m])) by (le))
      • histogram_quantile(0.75, sum(rate(envoy_http_downstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace", envoy_http_conn_manager_prefix!~"admin|admin_interface"}[2m])) by (le))
      • histogram_quantile(0.9, sum(rate(envoy_http_downstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace", envoy_http_conn_manager_prefix!~"admin|admin_interface"}[2m])) by (le))
      • histogram_quantile(0.99, sum(rate(envoy_http_downstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace", envoy_http_conn_manager_prefix!~"admin|admin_interface"}[2m])) by (le))
    • Panel Upstream latency percentiles by destination
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace", envoy_cluster_name=~"$destination"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace", envoy_cluster_name=~"$destination"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.9, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace", envoy_cluster_name=~"$destination"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace", envoy_cluster_name=~"$destination"}[2m])) by (le, envoy_cluster_name))
    • Panel Downstream latency percentiles by destination
      • histogram_quantile(0.5, sum(rate(envoy_http_downstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace"}[2m])) by (le, envoy_http_conn_manager_prefix))
      • histogram_quantile(0.75, sum(rate(envoy_http_downstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace"}[2m])) by (le, envoy_http_conn_manager_prefix))
      • histogram_quantile(0.9, sum(rate(envoy_http_downstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace"}[2m])) by (le, envoy_http_conn_manager_prefix))
      • histogram_quantile(0.99, sum(rate(envoy_http_downstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace"}[2m])) by (le, envoy_http_conn_manager_prefix))
  • Dashboard EventGate
    • Panel HTTPS p99 by source (via envoy)
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name=~"$service", site=~"$site", prometheus="ops"}[2m])) by (cluster, site,le))
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name=~"$service", site=~"$site", prometheus="k8s"}[2m])) by (deployment, site,le))
  • Dashboard hnowlan - API gateway historical SLO stats
    • Panel Request latency - 99th percentile read
      • histogram_quantile(0.99, sum(rate(envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="mwapi_cluster", kubernetes_namespace="api-gateway"}[5m])) by (le))
    • Panel Request latency - 99th percentile write
      • histogram_quantile(0.99, sum(rate(envoy_cluster_external_upstream_rq_time_bucket{envoy_cluster_name="mwapi_rw_cluster", kubernetes_namespace="api-gateway"}[5m])) by (le))
  • Dashboard hnowlan - Envoy proxy timings
    • Panel Proxy time
      • histogram_quantile(0.99, sum(irate(envoy_cluster_internal_upstream_rq_time_bucket{kubernetes_namespace="api-gateway"}[5m])) by (le))
    • Panel 99th Percentile responses
      • histogram_quantile(0.99, sum(rate(envoy_http_downstream_rq_time_bucket{kubernetes_namespace="api-gateway", envoy_http_conn_manager_prefix="ingress_http"}[5m])) by (le))
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{app="api-gateway", envoy_cluster_name!~"admin_cluster|rate_limit_cluster|admin"}[5m])) by (le))
      • histogram_quantile(0.99, sum(irate(envoy_cluster_internal_upstream_rq_time_bucket{kubernetes_namespace="api-gateway"}[5m])) by (le))
  • Dashboard iPoid
    • Panel Upstream latency percentiles
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace", envoy_cluster_name=~"$envoy_cluster_name|LOCAL.* [ ]", envoy_cluster_name!="admin_interfac…
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace", envoy_cluster_name=~"$destination||LOCAL.* [ ]", envoy_cluster_name!="admin_interface"}[2…
      • histogram_quantile(0.9, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace", envoy_cluster_name=~"$destination", envoy_cluster_name!="admin_interface"}[2m])) by (l…
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~"$kubernetes_namespace", envoy_cluster_name=~"$destination||LOCAL.* [ ]", envoy_cluster_name!="admin_interface"}[2…
    • Panel Downstream latency percentiles
      • histogram_quantile(0.5, sum(rate(envoy_http_downstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~".* [ ]", envoy_http_conn_manager_prefix!~"admin|admin_interface"}[2m])) by (le))
      • histogram_quantile(0.75, sum(rate(envoy_http_downstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~".* [ ]", envoy_http_conn_manager_prefix!~"admin|admin_interface"}[2m])) by (le))
      • histogram_quantile(0.9, sum(rate(envoy_http_downstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kkubernetes_namespace=~".* [ ]", envoy_http_conn_manager_prefix!~"admin|admin_interface"}[2m])) by (le))
      • histogram_quantile(0.99, sum(rate(envoy_http_downstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app", kubernetes_namespace=~".* [ ]", envoy_http_conn_manager_prefix!~"admin|admin_interface"}[2m])) by (le))
  • Dashboard Istio Control Plane Dashboard
    • Panel XDS Requests Size
      • quantile(0.5, rate(envoy_cluster_upstream_cx_rx_bytes_total{cluster_name="xds-grpc", site="$site", prometheus="$prometheus"}[1m]))
      • quantile(.5, rate(envoy_cluster_upstream_cx_tx_bytes_total{cluster_name="xds-grpc"}[1m]))
  • Dashboard jayme: container_cpu_cfs_throttled_seconds_total
    • Panel p99 latency by app cross cluster
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="local_service", site="$site", prometheus="$prometheus"}[2m])) by (le, envoy_cluster_name, app))
  • Dashboard jgiannelos-restbase-1week-migration
    • Panel Upstream latency percentiles (5m avg)
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace="rest-gateway", envoy_cluster_name="mobileapps_cluster"}[5m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace="rest-gateway", envoy_cluster_name="mobileapps_cluster"}[5m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.9, sum(rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace="rest-gateway", envoy_cluster_name="mobileapps_cluster"}[5m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace="rest-gateway", envoy_cluster_name="mobileapps_cluster"}[5m])) by (le, envoy_cluster_name))
  • Dashboard jgiannelos-restbase-hewiki-migration
    • Panel Upstream latency percentiles (5m avg)
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace="rest-gateway", envoy_cluster_name="mobileapps_cluster"}[5m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace="rest-gateway", envoy_cluster_name="mobileapps_cluster"}[5m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.9, sum(rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace="rest-gateway", envoy_cluster_name="mobileapps_cluster"}[5m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace="rest-gateway", envoy_cluster_name="mobileapps_cluster"}[5m])) by (le, envoy_cluster_name))
  • Dashboard Jobrunners
    • Panel p50 latency (envoy)
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster=~"jobrunner",envoy_cluster_name=~"local_port_9006", instance=~".* [ ]"}[2m])) by (le))
    • Panel 75th percentile (envoy)
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster=~"jobrunner",envoy_cluster_name=~"local_port_9006", instance=~".* [ ]"}[2m])) by (le))
    • Panel 95th percentile (envoy)
      • histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster=~"jobrunner",envoy_cluster_name=~"local_port_9006", instance=~".* [ ]"}[2m])) by (le))
  • Dashboard Kartotherian
    • Panel TLS quantiles for inbound traffic (LOCAL upstream)
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$service", envoy_cluster_name=~"$envoy_cluster_name|LOCAL.* [ ]", envoy_cluster_name!="admin_interface"}[2m])) by (le))
      • histogram_quantile(0.90, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$service", envoy_cluster_name=~"$envoy_cluster_name|LOCAL.* [ ]", envoy_cluster_name!="admin_interface"}[2m])) by (le))
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$service", envoy_cluster_name=~"$envoy_cluster_name|LOCAL.* [ ]", envoy_cluster_name!="admin_interface"}[2m])) by (le))
  • Dashboard MediaWiki on k8s
    • Panel p50
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
    • Panel p75
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
    • Panel p99
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
  • Dashboard MediaWiki Pods
    • Panel p50 ${kubernetes_pod_name}
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", kubernetes_pod_name="$kubernetes_pod_name" , deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])…
    • Panel p75 ${kubernetes_pod_name}
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", kubernetes_pod_name="$kubernetes_pod_name" , deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ]…
    • Panel p99 ${kubernetes_pod_name}
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", kubernetes_pod_name="$kubernetes_pod_name" , deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ]…
  • Dashboard Miscweb k8s
    • Panel Downstream latency percentiles by destination
      • histogram_quantile(0.5, sum(rate(envoy_http_downstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app"}[2m])) by (le, envoy_http_conn_manager_prefix))
      • histogram_quantile(0.75, sum(rate(envoy_http_downstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app"}[2m])) by (le, envoy_http_conn_manager_prefix))
      • histogram_quantile(0.9, sum(rate(envoy_http_downstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app"}[2m])) by (le, envoy_http_conn_manager_prefix))
      • histogram_quantile(0.99, sum(rate(envoy_http_downstream_rq_time_bucket{site="$site", prometheus="$prometheus", app="$app"}[2m])) by (le, envoy_http_conn_manager_prefix))
  • Dashboard Miscweb legacy
    • Panel Upstream latency percentiles by destination
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster=~"$origin", instance=~"$server:9631"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster=~"$origin", instance=~"$server:9631"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.9, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster=~"$origin", instance=~"$server:9631"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster=~"$origin",envoy_cluster_name=~"$destination", instance=~"$server:9631"}[2m])) by (le, envoy_cluster_name))
  • Dashboard mw on k8s - WIP ServiceOps
    • Panel p50
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{prometheus="k8s", site="$site", app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (l…
      • histogram_quantile(0.5, sum(rate(envoy_http_downstream_rq_time_bucket{prometheus="k8s", site="$site", app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
    • Panel p75
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{prometheus="k8s", site="$site", app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (…
      • histogram_quantile(0.75, sum(rate(envoy_http_downstream_rq_time_bucket{prometheus="k8s", site="$site", app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
    • Panel p99
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{prometheus="k8s", site="$site", app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (…
      • histogram_quantile(0.99, sum(rate(envoy_http_downstream_rq_time_bucket{prometheus="k8s", site="$site", app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
  • Dashboard mw-api-ext
    • Panel p50
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.5, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
    • Panel p75
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.75, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
    • Panel p99
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.99, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
  • Dashboard mw-api-int
    • Panel p50
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.5, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
    • Panel p75
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.75, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
    • Panel p99
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.99, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
  • Dashboard mw-jobrunner
    • Panel p50
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.5, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
    • Panel p75
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.75, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
    • Panel p99
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.99, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
  • Dashboard mw-parsoid
    • Panel p50
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.5, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
    • Panel p75
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.75, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
    • Panel p99
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.99, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
  • Dashboard mw-web
    • Panel p50
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.5, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
    • Panel p75
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.75, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
    • Panel p99
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.99, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
  • Dashboard mw-wikifunctions
    • Panel p50
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.5, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
    • Panel p75
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.75, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
    • Panel p99
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_m(ediawiki|w).* [ ])"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.99, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
  • Dashboard Ratelimit
    • Panel Latency percentiles
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$cluster", envoy_cluster_name="ratelimit"}[2m])) by (le, envoy_response_code))
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$cluster", envoy_cluster_name="ratelimit"}[2m])) by (le))
      • histogram_quantile(0.9, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$cluster", envoy_cluster_name="ratelimit"}[2m])) by (le))
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{site="$site", prometheus="$cluster", envoy_cluster_name="ratelimit"}[2m])) by (le))
  • Dashboard Shellbox
    • Panel Latency
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_.* [ ])",kubernetes_namespace="$namespace"} [5m])) by (le))
      • histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", release="$release", release="$release", envoy_cluster_name=~"(local_service|LOCAL_.* [ ])",kubernetes_namespace="$namespace"} [5m])) by (le))
  • Dashboard xxxx effie - k8s mwdebug
    • Panel p50
      • histogram_quantile(0.5, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"local_service"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.5, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
    • Panel p75
      • histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"local_service"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.75, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))
    • Panel p99
      • histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{ app="$service", deployment="$namespace", release="$release", release="$release", envoy_cluster_name=~"local_service"}[2m])) by (le, envoy_cluster_name))
      • histogram_quantile(0.99, sum(rate(envoy_http_downstream_rq_time_bucket{app="$service", deployment="$namespace", release="$release"}[2m])) by (le))

Event Timeline

After some examination and thoughts, unfortunately the current recording rules work only in very specific dashboards. Namely the ones using prometheus datasources (not thanos!) because we can't meaningfully aggregate over multiple dimensions.

The correct approach is to have recording rules for sum(rate()) and keep the buckets, then apply histogram_quantile over the recorded rules

While a worthwhile effort, the query limits enforcement has proven effective. We can reopen if these queries become a problem