Page MenuHomePhabricator

High MariaDB memory usage on es1035, es2038 and es2039
Closed, ResolvedPublic

Description

The 3 hosts triggered a warning on icinga approx 25/26 days ago.

The memory usage seems to have grown linearly over the last 30 days while other hosts remain constant:
https://grafana.wikimedia.org/goto/LCahmFBHR?orgId=1

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui subscribed.

Please upgrade+reboot them so you also get the latest kernel

All the es* need update as part of T395241 so that task will solve this one.

As discussed on IRC I suspect a memory leak (perhaps related to connections being restarted?) . Maybe we could consider lowering the threshold for memory usage warning and also introduce an alert on IRC

Marostegui raised the priority of this task from Medium to High.May 28 2025, 1:14 PM

This became a CRIT, please restart them proactively - do not wait for the other task

[15:13:05]  <+icinga-wm> PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.5% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting

Mentioned in SAL (#wikimedia-operations) [2025-05-29T10:07:06Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Depool es2039 T395294', diff saved to https://phabricator.wikimedia.org/P76665 and previous config saved to /var/cache/conftool/dbconfig/20250529-100704-fceratto.json

Completed depool of es2039 - Upgrading es2039.codfw.wmnet - fceratto@cumin1002

es2038 had to be restarted during an emergency, so it is done

Start pool of es2039 gradually with 4 steps - Ready - fceratto@cumin1002

es2039 is pooling it to alleviate load on other hosts (see T395551), icinga dowtime has been removed.

es2038 was rebooted during T395551, only es1035 is left for the switchover followed by depool and upgrade

Completed pool of es2039 gradually with 4 steps - Ready - fceratto@cumin1002

I had to restar es1035 due to issues today so this is effectively fixed