Page MenuHomePhabricator

[wikireplicas] clouddb* free memory decreases over time
Open, LowPublic

Description

Screenshot 2024-05-16 at 16.25.13.png (3,002×948 px, 477 KB)

Many clouddb* hosts show a pattern where the free memory decreases very slowly over time. This triggers a warning alert when the free memory is lower than 5%.

I wonder if this could be something similar to T353093: [toolsdb] MariaDB process is killed by OOM killer (December 2023) (although the decrease rate there was much faster), and was fixed by switching to jemalloc.

Event Timeline

fnegri renamed this task from [wikireplicas] clouddb hosts free memory decreases over time to [wikireplicas] clouddb* free memory decreases over time.May 16 2024, 2:40 PM
fnegri triaged this task as Low priority.
fnegri moved this task from Backlog to Wiki replicas on the Data-Services board.

Mentioned in SAL (#wikimedia-operations) [2024-05-16T15:45:12Z] <dhinus> systemctl restart mariadb@s4.service on clouddb1015 (using too much RAM) T365164

Restarting the mariadb@s4.service freed up about 300G of RAM:

Screenshot 2024-05-16 at 17.48.28.png (1,864×958 px, 184 KB)

I've added a note on this procedure to the alert runbook.

I also found a SAL log that indicates a similar thing happened on the same host a few months ago: https://sal.toolforge.org/log/wbX_F40BGiVuUzOdbGXw

New memory alerts started triggering for clouddb1016 and clouddb1020. I will restart them on Monday.

I did not restart the services, but the alerts disappeared from alerts.wikimedia.org. I can see they are still in status WARNING in Icinga though, I'm not sure why they are no longer visible in alerts.wikimedia.org.

The alerts are visible again. I will restart the services.