Page MenuHomePhabricator

cloudcontrol2006-dev struggling with memory
Closed, ResolvedPublic

Description

I saw an alert today:

FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM

The logs showed that the user with id 111 got a bunch of procs killed by the OOMkiller. This is rabbitmq.

The system had a very high load average as well.

I noticed the physical memory in the system is low (32GB) compared to other similar servers (cloudcontrol2004-dev 128GB, cloudcontrol2005-dev 500GB).

This server is definitely under memory pressure, on a normal day it will have only 1GB available.

Event Timeline

aborrero changed the task status from Open to In Progress.Oct 18 2024, 10:50 AM
aborrero raised the priority of this task from Medium to High.
aborrero added a project: User-aborrero.

Today the oomkiller victim was mariadb, which is maybe even more concerning that rabbitmq getting killed.

I'll raise priority, and see if DCops can help us replace this server.

aborrero added a parent task: Unknown Object (Task).Oct 18 2024, 11:07 AM
aborrero added a project: ops-codfw.
aborrero moved this task from Backlog to Blocked/waiting on the User-aborrero board.
aborrero added subscribers: Jhancock.wm, Papaul.

hey @Papaul and/or @Jhancock.wm per https://phabricator.wikimedia.org/T377568#10247882 you should be receiving memory expansion parts soon. Please upgrade the memory on this server to 128GB.

I got the memory in. Is it safe to proceed with the upgrade at this time? I didn't see if it got depooled already.

I got the memory in. Is it safe to proceed with the upgrade at this time? I didn't see if it got depooled already.

thanks @Jhancock.wm, the server can be upgraded anytime.

Jhancock.wm claimed this task.

upgraded to 128Gb. powering up now. Please let us know if you need additional help!