Page MenuHomePhabricator

Trigger alert on low disk space on k8s PVs
Closed, ResolvedPublic8 Estimated Story Points

Description

As we migrate data and and user add more data we need to know, and be able to act if we are running out of space.

Of important note are:

  • SQL
  • Query Service
  • Elasticsearch

Using https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent could be an option. This would allow us to monitor and alert as described at https://serverfault.com/questions/1012300/gcp-vm-disk-space-alert

Alternatively wbstack check's its storage with: https://github.com/wbstack/deploy/blob/main/k8s/cmd/storage-check.sh

A/C:

  • Alert when there is <15% free disk on the Persistent Volumes that are essential for the operation of the service

Event Timeline

Jan_Dittrich renamed this task from Alert on low disk space on k8s PVs to Trigger alert on low disk space on k8s PVs.Jul 4 2022, 8:37 AM

I'm not entirely convinced we should go with this ops agent approach, because it measures disk space at the VM level. While this isn't a wrong value, we might be more interested in the PV space usage than the "bare" disks instead.

Here is an example of using the metric named kubernetes.io/pod/volume/utilization. This might be the most useful one, as it gives us a percentage of disk utilized (the other two options are 'total capacity' and 'bytes used').

View example query for SQL data volumes on staging

I added a draft PR to sketch things out, haven't tested this yet: https://github.com/wmde/wbaas-deploy/pull/455

I am interested in trying out this patch on staging

tried applying on staging and got this error

╷
│ Error: Error creating AlertPolicy: googleapi: Error 400: Field alert_policy.conditions[0].condition_threshold.filter had an invalid value of "metric.type="kubernetes.io/pod/volume/utilization"
│ resource.label."cluster_name"=wbaas-2
│ resource.label."pod_name"="sql-mariadb-primary-0"
│ metric.label."volume_name"="data"
│ ": must specify a restriction on "resource.type" in the filter; see "https://cloud.google.com/monitoring/api/resources" for a list of available resource types.
│ 
│   with module.staging-monitoring.google_monitoring_alert_policy.alert_policy_sql_primary_pv_critical_utilization,
│   on ../../modules/monitoring/sql-pv.tf line 5, in resource "google_monitoring_alert_policy" "alert_policy_sql_primary_pv_critical_utilization":
│    5: resource "google_monitoring_alert_policy" "alert_policy_sql_primary_pv_critical_utilization" {
│ 
╵
╷
│ Error: Error creating AlertPolicy: googleapi: Error 400: Field alert_policy.conditions[0].condition_threshold.filter had an invalid value of "metric.type="kubernetes.io/pod/volume/utilization"
│ resource.label."cluster_name"=wbaas-2
│ resource.label."pod_name"="sql-mariadb-secondary-0"
│ metric.label."volume_name"="data"
│ ": must specify a restriction on "resource.type" in the filter; see "https://cloud.google.com/monitoring/api/resources" for a list of available resource types.
│ 
│   with module.staging-monitoring.google_monitoring_alert_policy.alert_policy_sql_secondary_pv_critical_utilization,
│   on ../../modules/monitoring/sql-pv.tf line 43, in resource "google_monitoring_alert_policy" "alert_policy_sql_secondary_pv_critical_utilization":
│   43: resource "google_monitoring_alert_policy" "alert_policy_sql_secondary_pv_critical_utilization" {
│ 
╵

The filter seem to need some adjustment

Rosalie_WMDE claimed this task.
Rosalie_WMDE moved this task from Doing to In Review on the Wikibase Cloud (WB Cloud Sprint 0) board.
Rosalie_WMDE moved this task from In Review to Doing on the Wikibase Cloud (WB Cloud Sprint 0) board.
Rosalie_WMDE subscribed.

I added a commit to the draft PR to fix the above error.

https://github.com/wmde/wbaas-deploy/pull/455

tested it on staging with sql, es and qs pods.

I merged the PR and tagged the new module version tf-module-monitoring-10