Trigger alert on low disk space on k8s PVs
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

None

Authored By

	Tarrow
	May 12 2022, 10:01 AM

Description

As we migrate data and and user add more data we need to know, and be able to act if we are running out of space.

Of important note are:

SQL
Query Service
Elasticsearch

Using https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent could be an option. This would allow us to monitor and alert as described at https://serverfault.com/questions/1012300/gcp-vm-disk-space-alert

Alternatively wbstack check's its storage with: https://github.com/wbstack/deploy/blob/main/k8s/cmd/storage-check.sh

A/C:

Alert when there is <15% free disk on the Persistent Volumes that are essential for the operation of the service

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T308220 Trigger alert on low disk space on k8s PVs
		Resolved		None	T310697 [timebox 16hrs] Investigate using Prometheus locally to monitor k8s PV utilisation

Event Timeline

Tarrow created this task.May 12 2022, 10:01 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 12 2022, 10:01 AM

Addshore awarded a token.May 12 2022, 10:02 AM

Addshore subscribed.

Tarrow moved this task from Backlog (incoming) to Launch Migration Kanban (2022) on the Wikibase Cloud board.Jun 14 2022, 10:09 AM

Tarrow edited projects, added Wikibase Cloud (Launch Migration Kanban (2022)); removed Wikibase Cloud.

Tarrow moved this task from Backlog to Blocked/Stalled on the Wikibase Cloud (Launch Migration Kanban (2022)) board.Jun 15 2022, 1:20 PM

Tarrow moved this task from Blocked/Stalled to Backlog on the Wikibase Cloud (Launch Migration Kanban (2022)) board.Jul 4 2022, 8:37 AM

• Jan_Dittrich renamed this task from Alert on low disk space on k8s PVs to Trigger alert on low disk space on k8s PVs.Jul 4 2022, 8:37 AM

conny-kawohl_WMDE edited projects, added Wikibase Cloud (WB Cloud Sprint 0); removed Wikibase Cloud (Launch Migration Kanban (2022)).Jul 7 2022, 9:26 AM

conny-kawohl_WMDE set the point value for this task to 8.

Deniz_WMDE claimed this task.Jul 7 2022, 10:01 AM

Deniz_WMDE moved this task from Sprint Backlog to Doing on the Wikibase Cloud (WB Cloud Sprint 0) board.

I'm not entirely convinced we should go with this ops agent approach, because it measures disk space at the VM level. While this isn't a wrong value, we might be more interested in the PV space usage than the "bare" disks instead.

Here is an example of using the metric named kubernetes.io/pod/volume/utilization. This might be the most useful one, as it gives us a percentage of disk utilized (the other two options are 'total capacity' and 'bytes used').

View example query for SQL data volumes on staging

I added a draft PR to sketch things out, haven't tested this yet: https://github.com/wmde/wbaas-deploy/pull/455

Deniz_WMDE removed Deniz_WMDE as the assignee of this task.Jul 8 2022, 3:12 PM

Deniz_WMDE moved this task from Doing to Sprint Backlog on the Wikibase Cloud (WB Cloud Sprint 0) board.

Deniz_WMDE subscribed.

Tarrow moved this task from Sprint Backlog to In Review on the Wikibase Cloud (WB Cloud Sprint 0) board.Jul 11 2022, 8:33 AM

Rosalie_WMDE claimed this task.Jul 15 2022, 8:09 AM

Rosalie_WMDE updated the task description. (Show Details)Jul 15 2022, 1:20 PM

I am interested in trying out this patch on staging

Rosalie_WMDE moved this task from In Review to Waiting for Deploy to Staging on the Wikibase Cloud (WB Cloud Sprint 0) board.Jul 15 2022, 1:24 PM

Rosalie_WMDE moved this task from Waiting for Deploy to Staging to Doing on the Wikibase Cloud (WB Cloud Sprint 0) board.Jul 18 2022, 10:48 AM

tried applying on staging and got this error

╷
│ Error: Error creating AlertPolicy: googleapi: Error 400: Field alert_policy.conditions[0].condition_threshold.filter had an invalid value of "metric.type="kubernetes.io/pod/volume/utilization"
│ resource.label."cluster_name"=wbaas-2
│ resource.label."pod_name"="sql-mariadb-primary-0"
│ metric.label."volume_name"="data"
│ ": must specify a restriction on "resource.type" in the filter; see "https://cloud.google.com/monitoring/api/resources" for a list of available resource types.
│ 
│   with module.staging-monitoring.google_monitoring_alert_policy.alert_policy_sql_primary_pv_critical_utilization,
│   on ../../modules/monitoring/sql-pv.tf line 5, in resource "google_monitoring_alert_policy" "alert_policy_sql_primary_pv_critical_utilization":
│    5: resource "google_monitoring_alert_policy" "alert_policy_sql_primary_pv_critical_utilization" {
│ 
╵
╷
│ Error: Error creating AlertPolicy: googleapi: Error 400: Field alert_policy.conditions[0].condition_threshold.filter had an invalid value of "metric.type="kubernetes.io/pod/volume/utilization"
│ resource.label."cluster_name"=wbaas-2
│ resource.label."pod_name"="sql-mariadb-secondary-0"
│ metric.label."volume_name"="data"
│ ": must specify a restriction on "resource.type" in the filter; see "https://cloud.google.com/monitoring/api/resources" for a list of available resource types.
│ 
│   with module.staging-monitoring.google_monitoring_alert_policy.alert_policy_sql_secondary_pv_critical_utilization,
│   on ../../modules/monitoring/sql-pv.tf line 43, in resource "google_monitoring_alert_policy" "alert_policy_sql_secondary_pv_critical_utilization":
│   43: resource "google_monitoring_alert_policy" "alert_policy_sql_secondary_pv_critical_utilization" {
│ 
╵

The filter seem to need some adjustment

Rosalie_WMDE removed Rosalie_WMDE as the assignee of this task.Jul 19 2022, 7:38 AM

Rosalie_WMDE claimed this task.

Rosalie_WMDE moved this task from Doing to In Review on the Wikibase Cloud (WB Cloud Sprint 0) board.

Rosalie_WMDE moved this task from In Review to Doing on the Wikibase Cloud (WB Cloud Sprint 0) board.

Rosalie_WMDE subscribed.

Rosalie_WMDE moved this task from Doing to In Review on the Wikibase Cloud (WB Cloud Sprint 0) board.Jul 19 2022, 11:44 AM

Rosalie_WMDE removed Rosalie_WMDE as the assignee of this task.Jul 19 2022, 12:00 PM

Deniz_WMDE claimed this task.Jul 20 2022, 8:58 AM

Deniz_WMDE moved this task from In Review to Doing on the Wikibase Cloud (WB Cloud Sprint 0) board.Jul 20 2022, 9:02 AM

I added a commit to the draft PR to fix the above error.

Tarrow moved this task from WB Cloud Sprint 0 to Wikibase.cloud (WB Cloud Sprint 1) on the Wikibase Cloud board.Jul 20 2022, 12:31 PM

Tarrow edited projects, added Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 1)); removed Wikibase Cloud (WB Cloud Sprint 0).

Deniz_WMDE moved this task from Sprint Backlog to Doing on the Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 1)) board.Jul 20 2022, 12:56 PM

https://github.com/wmde/wbaas-deploy/pull/455

tested it on staging with sql, es and qs pods.

I merged the PR and tagged the new module version tf-module-monitoring-10

Deniz_WMDE removed Deniz_WMDE as the assignee of this task.Jul 20 2022, 1:43 PM

Deniz_WMDE moved this task from Doing to In Review on the Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 1)) board.

Deniz_WMDE moved this task from In Review to Waiting for Deploy to Staging on the Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 1)) board.Jul 20 2022, 4:01 PM

staging: https://github.com/wmde/wbaas-deploy/pull/462
production: https://github.com/wmde/wbaas-deploy/pull/463

Rosalie_WMDE moved this task from Waiting for Deploy to Staging to Waiting for Deploy to Production on the Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 1)) board.Jul 21 2022, 8:08 AM

Rosalie_WMDE moved this task from Waiting for Deploy to Production to Done on the Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 1)) board.Jul 21 2022, 8:50 AM

Evelien_WMDE closed subtask T310697: [timebox 16hrs] Investigate using Prometheus locally to monitor k8s PV utilisation as Resolved.Jul 25 2022, 9:15 AM

Evelien_WMDE closed this task as Resolved.Jul 25 2022, 9:17 AM

Evelien_WMDE mentioned this in T298798: Add monitoring of space available for wikibase.dev persistence.Jul 29 2022, 11:40 AM

• toan mentioned this in T310233: [8hrs] Investigate using Google Cloud provided Prometheus to store metrics.Jul 29 2022, 3:12 PM

Trigger alert on low disk space on k8s PVsClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Trigger alert on low disk space on k8s PVs
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...