n.b. this might require using more than one Prometheus project (e.g. alertmanager etc.)
A/C:
- Provide a PR with an example setup to be looked at
- add some rough docs here of your findings
Tarrow | |
Jun 15 2022, 1:19 PM |
Restricted File | |
Jun 23 2022, 3:51 PM |
n.b. this might require using more than one Prometheus project (e.g. alertmanager etc.)
A/C:
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T308220 Trigger alert on low disk space on k8s PVs | |||
Resolved | None | T310697 [timebox 16hrs] Investigate using Prometheus locally to monitor k8s PV utilisation |
So starting off i had a wee look at https://github.com/bitnami/charts/tree/master/bitnami/kube-prometheus and realized I'd have to connect all the pieces myself to get the full stack with alertmanager etc.
I then found prometheus-community/kube-prometheus-stack which offers the full stack including
These are quite nicely bundled in one chart and was fairly easy to setup with some initial quirks that are docoumented as comments in this commit
It also comes bundled with a wide range of Kubernetes dashboards ready to use.
Now to the sad part, with all that free good stuff there seems to be some major problems with monitoring volume usage for k8s and I haven't fully understood why this just doesn't work. It seems to be related to the volume drivers used either not using a supported interface(but we should already have these commits AFAICT) used to extract these metrics or that we are lagged behind in the k8s version we're using T311205: 🔷 Upgrade kubernetes from 1.21 to 1.22(EOL on 28th this month). That being said I haven't tried this out on staging and there is still a chance I'm just seeing these problems because it's running in docker on minikube, it could be a good idea to use the same PR and try it out there before giving up on this. I've tried locally updating the kubernetes version to 1.23 (manually disabling any deprecated resources) but without any success.
Ok, so I tried out installing this for a while on staging and it indeed seems to be working pretty much out of the box, some dashboards still do not report the metrics i suppose they should but overall seems promising.
{F35268273}
As for running this locally it seems we might have to live with a reduced set of metrics if we were to use this.
I also gave these scripts a go https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/hack/minikube/cmd.sh but they seem terribly outdated and broken, wasn't even able to run all of the scripts on minikube v1.25.2.
notably, that script also supplies some additional flags to minikube but seems the defaults would already apply
-b, --bootstrapper string The name of the cluster bootstrapper that will set up the Kubernetes cluster. (default "kubeadm")
as for the hyper-v flag that only applies to windows machines, maybe that's a clue where it originally worked.
This looks good to me, the draft PR is here and the documentation of findings too. I could not see this image {F35268273} tho. Could we please try this locally for me when you are around @toan? I don't know if running a make apply on the patch is enough to have it working.
I tried this locally and can confirm that it works, apart from some metrics like the PV usage. I propose to deploy it to staging and see how it goes - but before that we should at least change the default credentials to grafana. Does anyone see more we should do here before deploying?
Yeah good suggestion. I think in general we need to fix persistence because currently I don't think any of the metrics is actually stored between restarts. We should probably also spec out the values file to contain all the deployments it creates with enabled: true.
But I don't think this should be done now or in this ticket and rather be described and done when T310233: [8hrs] Investigate using Google Cloud provided Prometheus to store metrics is decided to be worked on.
@Evelien_WMDE: The project tag got archived and this open task has no other active project tags. Could you please either add an active project tag so this task can be found, or update the task status? Thanks a lot!