Page MenuHomePhabricator

Production Error: Secondary mariadb fell over because OOM
Closed, ResolvedPublic

Assigned To
Authored By
toan
Aug 5 2022, 2:11 PM
Referenced Files
F35419581: image.png
Aug 10 2022, 3:14 PM
F35419751: image.png
Aug 10 2022, 3:14 PM
F35418927: image.png
Aug 10 2022, 3:14 PM
F35400120: image.png
Aug 5 2022, 3:39 PM
F35399405: image.png
Aug 5 2022, 2:11 PM

Description

Yesterday the secondary mariadb pod fell over because the backup made it allocate more memory than allowed.

image.png (276×1 px, 22 KB)

AC:

  • Investigate ways for the backup to not cause the secondary database to fall over and report back here.
  • (Optional) Fix the problem

Event Timeline

image.png (246×1 px, 26 KB)

Did some manual runs on staging but no dramatic increases in usage

table_open_cache

So after doing some testing and review of previous memory usage on staging it seems to be related to the open table cache (lots of memory released but still held on too when FLUSH TABLES was issued).

In the picture below is two previous backups being taken (the spikes on the CPU chart correlate with the memory steps increasing)

image.png (339×1 px, 27 KB)

Since this isn't actual user traffic generating these caches I tried to do some experiments running the backups with a lower table_open_cache value set for the secondary

This value is currently not configured from anything but it's 2000 default

table_open_cache has defaulted to 2000 since MariaDB 10.1.7. Before that, the default was 400.

Seeing that we have many databases, with many many tables lowering this number to 200 was attempted.

image.png (230×1 px, 27 KB)

Here I was able to run a whole bunch of backups without it exceeding the requested memory

Restoring the value to 2000 and trying again gives the same behavior we saw before and the memory expands in much bigger increments.

image.png (223×1 px, 28 KB)

Here the memory usage expands past the request in less backups.

I'm not entirely sure if this is the only culprit but it really seems like lowering the table_open_cache value would reduce the memory usage, or at least lower the impact it has when a dump is taken . I couldn't notice any degraded performance when using the staging environment with this setting.

Increase replica count

Another solution we probably want to do either way as the platform grows is to add another secondary replica. This way it wouldn't matter as much if one fell over.

Dedicated dump replica

This is something we've talked about earlier but I can't see a really straightforward way of doing this with the current chart we are using, would probably have to configure a separate mariadb instance that is just configured to replicate from the primary of the first deployment. This way we wouldn't have the backup load on top of normal usage and it wouldn't matter for users if the dump replica fell over every once in a while.

I'd propose we try lowering the table_open_cache (to 200 maybe? 400?) for the secondary on production and see how that affects it.

I've been working through this and trying to make sure I understand the situation and I'm still slightly confused.

Do we think that the total memory limit is less than the amount MariaDB "should" be able to raise to if the full 2000 tables are in the open_table cache? If this is the case then maybe we should either increase the limit or decrease the innodb_buffer_pool_size to give it a little more headroom.

Or do we think that there is some bug/memory leak that means that the memory will *keep* growing by some fraction of the open_table_cache size and therefore this is why you suggest drastically reducing it (to 10% of it's previous value)?

I'm slightly worried that even after a few hours of reading the documentation I still don't actually understand how "big" the table_open_cache should grow to be in bytes. I also see that the recommendation to tune it (https://mariadb.com/kb/en/optimizing-table_open_cache/) seems to be entirely focused on keeping an eye on the open and opened tables rather than to reduce memory consumption.

Tarrow subscribed.

I've been working through this and trying to make sure I understand the situation and I'm still slightly confused.

Do we think that the total memory limit is less than the amount MariaDB "should" be able to raise to if the full 2000 tables are in the open_table cache? If this is the case then maybe we should either increase the limit or decrease the innodb_buffer_pool_size to give it a little more headroom.

Or do we think that there is some bug/memory leak that means that the memory will *keep* growing by some fraction of the open_table_cache size and therefore this is why you suggest drastically reducing it (to 10% of it's previous value)?

I'm slightly worried that even after a few hours of reading the documentation I still don't actually understand how "big" the table_open_cache should grow to be in bytes. I also see that the recommendation to tune it (https://mariadb.com/kb/en/optimizing-table_open_cache/) seems to be entirely focused on keeping an eye on the open and opened tables rather than to reduce memory consumption.

This memory usage seems to be directly linked to the backups being taken when every table is read and dumped, this makes the tuning suggestions hard, if not impossible to follow since the open / `opened variables isn't from normal usage of the platform. It's my suspicion that this cache is just filled because of the backups being taken, I don't think it's a memory leak or anything like that but rather the backups filling it up in an unnecessary way (that being said we are using a fairly old version of mydumper).

I think my preferred solution would be to setup a dedicated dump replica and use that. This way we would only have "normal" usage on the secondary.

I tried setting up a dump replica by using the following PR.

It uses the same bitnami/mariadb chart but in a standalone configuration re-using the same secret names.

  1. Install the chart

gives the following output for the new SQL deployment

default, sql-dump-replica-mariadb, ConfigMap (v1) has been added:
- 
+ # Source: mariadb/templates/primary/configmap.yaml
+ apiVersion: v1
+ kind: ConfigMap
+ metadata:
+   name: sql-dump-replica-mariadb
+   namespace: "default"
+   labels:
+     app.kubernetes.io/name: mariadb
+     helm.sh/chart: mariadb-10.5.0
+     app.kubernetes.io/instance: sql-dump-replica
+     app.kubernetes.io/managed-by: Helm
+     app.kubernetes.io/component: primary
+ data:
+   my.cnf: |-
+     [mysqld]
+     server_id=9000
+     skip-name-resolve
+     explicit_defaults_for_timestamp
+     basedir=/opt/bitnami/mariadb
+     port=3306
+     socket=/opt/bitnami/mariadb/tmp/mysql.sock
+     tmpdir=/opt/bitnami/mariadb/tmp
+     max_allowed_packet=16M
+     bind-address=0.0.0.0
+     pid-file=/opt/bitnami/mariadb/tmp/mysqld.pid
+     log-error=/opt/bitnami/mariadb/logs/mysqld.log
+     character-set-server=UTF8
+     collation-server=utf8_general_ci
+     
+     # Custom
+     # 16MB, default is 8MB, bitnami was 128
+     key_buffer_size=16777216
+     # 67MB, default & bitnami was 134MB (134217728)
+     innodb_buffer_pool_size=67108864
+     # 8MB, default & bitnami was 16MB (16777216)
+     tmp_table_size=8388608
+     # 80, default & bitnami was 151
+     max_connections=80
+     # 7 days, default & bitnami is 0 (keep forever.)
+     expire_logs_days=7
+     # 100MB, default & bitnami was 1GB (1073741824)
+     max_binlog_size=107374182
+     
+     [mysql]
+     # https://forums.mysql.com/read.php?103,189835,192421#msg-192421
+     default-character-set=UTF8
+     
+     [client]
+     port=3306
+     socket=/opt/bitnami/mariadb/tmp/mysql.sock
+     
+     [manager]
+     port=3306
+     socket=/opt/bitnami/mariadb/tmp/mysql.sock
+     pid-file=/opt/bitnami/mariadb/tmp/mysqld.pid
default, sql-dump-replica-mariadb, Service (v1) has been added:
- 
+ # Source: mariadb/templates/primary/svc.yaml
+ apiVersion: v1
+ kind: Service
+ metadata:
+   name: sql-dump-replica-mariadb
+   namespace: "default"
+   labels:
+     app.kubernetes.io/name: mariadb
+     helm.sh/chart: mariadb-10.5.0
+     app.kubernetes.io/instance: sql-dump-replica
+     app.kubernetes.io/managed-by: Helm
+     app.kubernetes.io/component: primary
+   annotations:
+ spec:
+   type: ClusterIP
+   sessionAffinity: None
+   ports:
+     - name: mysql
+       port: 3306
+       protocol: TCP
+       targetPort: mysql
+       nodePort: null
+   selector: 
+     app.kubernetes.io/name: mariadb
+     app.kubernetes.io/instance: sql-dump-replica
+     app.kubernetes.io/component: primary
default, sql-dump-replica-mariadb, ServiceAccount (v1) has been added:
- 
+ # Source: mariadb/templates/serviceaccount.yaml
+ apiVersion: v1
+ kind: ServiceAccount
+ metadata:
+   name: sql-dump-replica-mariadb
+   namespace: "default"
+   labels:
+     app.kubernetes.io/name: mariadb
+     helm.sh/chart: mariadb-10.5.0
+     app.kubernetes.io/instance: sql-dump-replica
+     app.kubernetes.io/managed-by: Helm
+   annotations:
+ automountServiceAccountToken: false
default, sql-dump-replica-mariadb, StatefulSet (apps) has been added:
- 
+ # Source: mariadb/templates/primary/statefulset.yaml
+ apiVersion: apps/v1
+ kind: StatefulSet
+ metadata:
+   name: sql-dump-replica-mariadb
+   namespace: default
+   labels:
+     app.kubernetes.io/name: mariadb
+     helm.sh/chart: mariadb-10.5.0
+     app.kubernetes.io/instance: sql-dump-replica
+     app.kubernetes.io/managed-by: Helm
+     app.kubernetes.io/component: primary
+ spec:
+   replicas: 1
+   revisionHistoryLimit: 10
+   selector:
+     matchLabels: 
+       app.kubernetes.io/name: mariadb
+       app.kubernetes.io/instance: sql-dump-replica
+       app.kubernetes.io/component: primary
+   serviceName: sql-dump-replica-mariadb
+   updateStrategy:
+     type: RollingUpdate
+   template:
+     metadata:
+       annotations:
+         checksum/configuration: 6cc4a78a3730ea8992be76e492428c19e83e5fe38260813dfc6d7465073a3063
+       labels:
+         app.kubernetes.io/name: mariadb
+         helm.sh/chart: mariadb-10.5.0
+         app.kubernetes.io/instance: sql-dump-replica
+         app.kubernetes.io/managed-by: Helm
+         app.kubernetes.io/component: primary
+     spec:
+       
+       serviceAccountName: sql-dump-replica-mariadb
+       affinity:
+         podAffinity:
+           
+         podAntiAffinity:
+           preferredDuringSchedulingIgnoredDuringExecution:
+             - podAffinityTerm:
+                 labelSelector:
+                   matchLabels:
+                     app.kubernetes.io/name: mariadb
+                     app.kubernetes.io/instance: sql-dump-replica
+                     app.kubernetes.io/component: primary
+                 namespaces:
+                   - "default"
+                 topologyKey: kubernetes.io/hostname
+               weight: 1
+         nodeAffinity:
+           
+       securityContext:
+         fsGroup: 1001
+       containers:
+         - name: mariadb
+           image: docker.io/bitnami/mariadb:10.5.15-debian-10-r52
+           imagePullPolicy: "IfNotPresent"
+           securityContext:
+             runAsNonRoot: true
+             runAsUser: 1001
+           env:
+             - name: BITNAMI_DEBUG
+               value: "false"
+             - name: MARIADB_ROOT_PASSWORD
+               valueFrom:
+                 secretKeyRef:
+                   name: sql-secrets-passwords
+                   key: mariadb-root-password
+             - name: MARIADB_DATABASE
+               value: "my_database"
+           envFrom:
+             - secretRef:
+                 name: sql-secrets-init-passwords
+           ports:
+             - name: mysql
+               containerPort: 3306
+           livenessProbe:
+             failureThreshold: 3
+             initialDelaySeconds: 120
+             periodSeconds: 10
+             successThreshold: 1
+             timeoutSeconds: 1
+             exec:
+               command:
+                 - /bin/bash
+                 - -ec
+                 - |
+                   password_aux="${MARIADB_ROOT_PASSWORD:-}"
+                   if [[ -f "${MARIADB_ROOT_PASSWORD_FILE:-}" ]]; then
+                       password_aux=$(cat "$MARIADB_ROOT_PASSWORD_FILE")
+                   fi
+                   mysqladmin status -uroot -p"${password_aux}"
+           readinessProbe:
+             failureThreshold: 3
+             initialDelaySeconds: 30
+             periodSeconds: 10
+             successThreshold: 1
+             timeoutSeconds: 1
+             exec:
+               command:
+                 - /bin/bash
+                 - -ec
+                 - |
+                   password_aux="${MARIADB_ROOT_PASSWORD:-}"
+                   if [[ -f "${MARIADB_ROOT_PASSWORD_FILE:-}" ]]; then
+                       password_aux=$(cat "$MARIADB_ROOT_PASSWORD_FILE")
+                   fi
+                   mysqladmin status -uroot -p"${password_aux}"
+           resources: 
+             limits:
+               cpu: 750m
+               memory: 1000Mi
+             requests:
+               cpu: 40m
+               memory: 700Mi
+           volumeMounts:
+             - name: data
+               mountPath: /bitnami/mariadb
+             - name: custom-init-scripts
+               mountPath: /docker-entrypoint-initdb.d
+             - name: config
+               mountPath: /opt/bitnami/mariadb/conf/my.cnf
+               subPath: my.cnf
+       volumes:
+         - name: config
+           configMap:
+             name: sql-dump-replica-mariadb
+         - name: custom-init-scripts
+           configMap:
+             name: sql-dump-replica-mariadb-init-scripts
+   volumeClaimTemplates:
+     - metadata:
+         name: data
+         labels: 
+           app.kubernetes.io/name: mariadb
+           app.kubernetes.io/instance: sql-dump-replica
+           app.kubernetes.io/component: primary
+       spec:
+         accessModes:
+           - "ReadWriteOnce"
+         resources:
+           requests:
+             storage: "1Gi"
+         storageClassName: standard
default, sql-dump-replica-mariadb-init-scripts, ConfigMap (v1) has been added:
- 
+ # Source: mariadb/templates/primary/initialization-configmap.yaml
+ apiVersion: v1
+ kind: ConfigMap
+ metadata:
+   name: sql-dump-replica-mariadb-init-scripts
+   namespace: "default"
+   labels:
+     app.kubernetes.io/name: mariadb
+     helm.sh/chart: mariadb-10.5.0
+     app.kubernetes.io/instance: sql-dump-replica
+     app.kubernetes.io/managed-by: Helm
+     app.kubernetes.io/component: primary
+ data:
+   init-primary-only.sh: |
+     #!/bin/sh
+     if [[ $(hostname) == *primary* ]]; then
+         echo "Primary node, executing"
+     else
+         echo "NOT Primary node, EXITING"
+         exit
+     fi
  1. Drop existing data from dump replica
  1. take a backup and restore it

after that the restore pod should be running already.

(minikube-wbaas) $ kubectl exec -it restore-sql-logic-backup -- bash
notroot@restore-sql-logic-backup:/app$ ./backup.sh 
/backups/tmp/backup-2022-08-16_154729/metadata exists!
/backups/tmp/backup-2022-08-16_154729/apidb.users.sql exists!
/backups/tmp/backup-2022-08-16_154729/apidb.users-schema.sql exists!
/backups/tmp/backup-2022-08-16_154729/apidb.wikis.sql exists!
/backups/tmp/backup-2022-08-16_154729/apidb.wikis-schema.sql exists!
/backups/tmp/backup-2022-08-16_154729/mysql.db.sql exists!
/backups/tmp/backup-2022-08-16_154729/mysql.db-schema.sql exists!
notroot@restore-sql-logic-backup:/app$ echo $DB_HOST 
sql-mariadb-primary.default.svc.cluster.local
notroot@restore-sql-logic-backup:/app$ export DB_HOST=sql-dump-replica-mariadb.default.svc.cluster.local
notroot@restore-sql-logic-backup:/app$ ./restore.sh /backups/tmp/backup-2022-08-16_154729/
  1. Configure and start the replica ( ideally be the replication user )
CHANGE MASTER TO
  MASTER_HOST='sql-mariadb-primary.default.svc.cluster.local',
  MASTER_USER='root',
  MASTER_PASSWORD='secretpassword',
  MASTER_PORT=3306,
  MASTER_LOG_FILE='mysql-bin.000116',
  MASTER_LOG_POS=723834,
  MASTER_CONNECT_RETRY=10;

Release any locks and start the replica

START SLAVE;

This would require us to lock the databases for a little bit for this maintenance.

We could try it out on staging for a bit?

@toan and I have concluded that this does look like a good direction to go in.

Next steps are to:

  • deploy a dedicated backup replica to staging using something similar to the chart above.
  • Seed the backup replica with data from the most recent back up and re-enable replication
  • adjust the staging TF code so that the backup checks (e.g. for replication lag to the backup replica) are correct
  • leave running for a few days to confirm backups exist and could be restored from
  • if all successful repeat this pattern for production

Sounds like a good plan to me, I'll create a placeholder ticket with those steps so we can sketch it more out in refinement https://phabricator.wikimedia.org/T316214

Evelien_WMDE claimed this task.