Page MenuHomePhabricator

[4h timebox] SQL secondary is regularly being killed due to OOM
Closed, DeclinedPublic

Description

The secondary DB falls over every 3-4days; it also generally seem to do this around midnight UTC which is annoyingly around the time our backups run. Users probably see this as a bit of downtime; for us it's annoying because we see failed backup jobs.

Determined as an outcome of T328923 / T333669

Evidence of this happening:

Screenshot from 2023-04-06 12-21-23.png (371×1 px, 102 KB)

{
  "insertId": "1m02krng1vag3u7",
  "jsonPayload": {
    "source": {
      "component": "kernel-monitor",
      "host": "gke-wbaas-3-n2-highmem-16-pool-85ebecb9-i2om"
    },
    "reason": "OOMKilling",
    "type": "Warning",
    "message": "Memory cgroup out of memory: Killed process 276391 (mysqld) total-vm:11713196kB, anon-rss:4304992kB, file-rss:22508kB, shmem-rss:0kB, UID:1001 pgtables:9204kB oom_score_adj:969",
    "kind": "Event",
    "lastTimestamp": "2023-04-04T00:01:21Z",
    "reportingComponent": "",
    "eventTime": null,
    "apiVersion": "v1",
    "reportingInstance": "",
    "involvedObject": {
      "kind": "Node",
      "name": "gke-wbaas-3-n2-highmem-16-pool-85ebecb9-i2om",
      "uid": "gke-wbaas-3-n2-highmem-16-pool-85ebecb9-i2om"
    },
    "metadata": {
      "resourceVersion": "3730507",
      "creationTimestamp": "2023-04-04T00:01:22Z",
      "namespace": "default",
      "managedFields": [
        {
          "manager": "node-problem-detector",
          "fieldsV1": {
            "f:firstTimestamp": {},
            "f:message": {},
            "f:lastTimestamp": {},
            "f:involvedObject": {},
            "f:type": {},
            "f:reason": {},
            "f:count": {},
            "f:source": {
              "f:component": {},
              "f:host": {}
            }
          },
          "operation": "Update",
          "apiVersion": "v1",
          "time": "2023-04-04T00:01:22Z",
          "fieldsType": "FieldsV1"
        }
      ],
      "uid": "30ebf589-f148-48b2-a1f5-ce94cf37df09",
      "name": "gke-wbaas-3-n2-highmem-16-pool-85ebecb9-i2om.1752924fb2d6c4ee"
    }
  },
  "resource": {
    "type": "k8s_node",
    "labels": {
      "node_name": "gke-wbaas-3-n2-highmem-16-pool-85ebecb9-i2om",
      "location": "europe-west3-a",
      "cluster_name": "wbaas-3",
      "project_id": "wikibase-cloud"
    }
  },
  "timestamp": "2023-04-04T00:01:21Z",
  "severity": "WARNING",
  "logName": "projects/wikibase-cloud/logs/events",
  "receiveTimestamp": "2023-04-04T00:01:27.048651171Z"
}

A nice graph of the memory increasing and then plummeting:

image.png (273×1 px, 13 KB)

This can be seen happening every few days by using a query similar to:

logName="projects/wikibase-cloud/logs/events"
jsonPayload.message:"Memory cgroup out of memory"
jsonPayload.message:"mysqld"

A/C:

  • Increase the RAM (double it)
  • After this investigate if there is a reason for this ram problem that is fixable
    • This could involve finding out if MariaDB is poorly configured
  • If still time left: investigate what would be "the right amount" of RAM to allocate for our purposes

Event Timeline

Tarrow renamed this task from SQL secondary is regularly being killed due to OOM to [4h timebox] SQL secondary is regularly being killed due to OOM.Apr 20 2023, 2:01 PM
Tarrow updated the task description. (Show Details)

RAM is now raised to 8GB for both primary and secondary. Both Pods are allowed to use full 2 CPUs. https://github.com/wmde/wbaas-deploy/pull/904 and https://github.com/wmde/wbaas-deploy/pull/905