Page MenuHomePhabricator

Alert on unscrapable pods
Closed, DeclinedPublic

Description

As evidenced in https://phabricator.wikimedia.org/T371885 there are pods which can't be scraped by prometheus k8s, either because of misconfiguration or actual problems.

Therefore we should be alerting on pods that are not available for scraping, i.e. the JobUnavailable equivalent we have in production, but for pods.

The query is 1 - (count by (kubernetes_namespace, app, prometheus) (up{app!=""} == 0) / count by (kubernetes_namespace, app,prometheus) (up)) < 0.8 https://w.wiki/DW5V

As of March 20th the list looks like this:

{app="mediawiki", kubernetes_namespace="mw-videoscaler", prometheus="k8s"} 0.0916
{app="thumbor", kubernetes_namespace="thumbor", prometheus="k8s"} 0.1999
{app="developer-portal", kubernetes_namespace="developer-portal", prometheus="k8s"} 0.3333
{app="api-gateway", kubernetes_namespace="rest-gateway", prometheus="k8s"} 0.5
{app="mcrouter", kubernetes_namespace="mw-mcrouter", prometheus="k8s"} 0.5
{app="toolhub", kubernetes_namespace="toolhub", prometheus="k8s"} 0.6

{app="thumbor", kubernetes_namespace="thumbor", prometheus="k8s-staging"} 0.4
{app="api-gateway", kubernetes_namespace="rest-gateway", prometheus="k8s-staging"} 0.5
{app="toolhub", kubernetes_namespace="toolhub", prometheus="k8s-staging"} 0.6

{app="spark-history", kubernetes_namespace="spark-history", prometheus="k8s-dse"} 0
{app="spark-history", kubernetes_namespace="spark-history-test", prometheus="k8s-dse"} 0
{app="airflow", kubernetes_namespace="airflow-platform-eng", prometheus="k8s-dse"} 0.25
{app="airflow", kubernetes_namespace="airflow-analytics-product", prometheus="k8s-dse"} 0.3000
{app="airflow", kubernetes_namespace="airflow-analytics-test", prometheus="k8s-dse"} 0.3000
{app="airflow", kubernetes_namespace="airflow-main", prometheus="k8s-dse"} 0.3000
{app="airflow", kubernetes_namespace="airflow-research", prometheus="k8s-dse"} 0.3000
{app="airflow", kubernetes_namespace="airflow-search", prometheus="k8s-dse"} 0.3000
{app="airflow", kubernetes_namespace="airflow-test-k8s", prometheus="k8s-dse"} 0.3000
{app="airflow", kubernetes_namespace="airflow-wmde", prometheus="k8s-dse"} 0.3000
{app="airflow", kubernetes_namespace="airflow-ml", prometheus="k8s-dse"} 0.3333
{app="mpic", kubernetes_namespace="mpic", prometheus="k8s-dse"} 0.6666
{app="mpic", kubernetes_namespace="mpic-next", prometheus="k8s-dse"} 0.6666

{app="ores-legacy", kubernetes_namespace="ores-legacy", prometheus="k8s-mlserve"} 0.3333
{app="recommendation-api-ng", kubernetes_namespace="recommendation-api-ng", prometheus="k8s-mlserve"} 0.3333
{app="net-istio-controller", kubernetes_namespace="knative-serving", prometheus="k8s-mlserve"} 0.5

{app="ores-legacy", kubernetes_namespace="ores-legacy", prometheus="k8s-mlstaging"} 0.3333
{app="recommendation-api-ng", kubernetes_namespace="recommendation-api-ng", prometheus="k8s-mlstaging"} 0.3333
{app="net-istio-controller", kubernetes_namespace="knative-serving", prometheus="k8s-mlstaging"} 0.5
{app="developer-portal", kubernetes_namespace="developer-portal", prometheus="k8s-staging"} 0.3333

Event Timeline

With how the prometheus service discovery currently works (e.g scraping every container port by default) we do have a large number of "okay to be down" targets, so an alert like this will produce quite some alerts. It's also pretty common for pods to go away, which might produce a flurry of alerts as well.

Indeed on the pod granularity the alert would be noisy, I checked the data in terms of "percentage of reported up" by namespace + app and maybe this has more signal? https://w.wiki/AtrU

{app="activator", kubernetes_namespace="knative-serving"}1
{app="airflow", kubernetes_namespace="airflow-test-k8s"}0.6666666666666666
{app="api-gateway", kubernetes_namespace="rest-gateway"}0.5
{app="autoscaler", kubernetes_namespace="knative-serving"}1
{app="controller", kubernetes_namespace="knative-serving"}1
{app="developer-portal", kubernetes_namespace="developer-portal"}0.6666666666666666
{app="domain-mapping", kubernetes_namespace="knative-serving"}1
{app="domainmapping-webhook", kubernetes_namespace="knative-serving"}1
{app="mcrouter", kubernetes_namespace="mw-mcrouter"}0.5
{app="net-istio-controller", kubernetes_namespace="knative-serving"}1
{app="ores-legacy", kubernetes_namespace="ores-legacy"}0.6666666666666666
{app="recommendation-api-ng", kubernetes_namespace="recommendation-api-ng"}0.6666666666666666
{app="shellbox", kubernetes_namespace="shellbox-media"}0.1
{app="spark-history", kubernetes_namespace="spark-history-test"}1
{app="spark-history", kubernetes_namespace="spark-history"}1
{app="thumbor", kubernetes_namespace="thumbor"}0.7978547854785478
{app="toolhub", kubernetes_namespace="toolhub"}0.43333333333333335
{app="webhook", kubernetes_namespace="knative-serving"}1
{}0.002410755679184052
{kubernetes_namespace="mathoid"}0.5

The problem I see here is that we configure the k8s-pods job to scrape all configured containerPorts of a pod if more than one needs to be scraped (https://phabricator.wikimedia.org/T318707#8878939). That does not go well with the suggested alert and we would need to make exceptions. We could alert on misconfiguration using this (e.g. no target being up), but I'm not sure which problems other then that we could detect without much noise

RLazarus subscribed.

From the serviceops triage meeting:

We probably won't do this this way, because it's impossible to construct in a way that isn't flaky. We could, in principle, do something else: run a periodic check to see which pods are unreachable, record which ones, and only alert if the same pods are unreachable twice in a row -- and only then if there's a lot of them.

That lets us get alerted to persistently unscraped pods without creating a lot of noise, but it would take a fair amount of work and we don't think it's justified right now.

Closing, but we can always revisit if this problem recurs.