Page MenuHomePhabricator

db2146 memory warning
Closed, ResolvedPublic

Description

On icinga:

WARN Memory 90% used. Largest process: mysqld (1274) = 89.7%

Event Timeline

Marostegui triaged this task as Medium priority.Thu, Nov 9, 8:29 PM
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2023-11-09T20:54:47Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2146 T350916', diff saved to https://phabricator.wikimedia.org/P53248 and previous config saved to /var/cache/conftool/dbconfig/20231109-205445-root.json

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db2146.codfw.wmnet with OS bookworm

I've not seen anything obvious in particular on why this host had this warning. There's not any obvious increase if we look at the last 30 days of memory usage. So I am going to go ahead and upgrade it to bookwarm and a newer mariadb version.

Change 973236 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2146: Disable notifications

https://gerrit.wikimedia.org/r/973236

Change 973236 merged by Marostegui:

[operations/puppet@production] db2146: Disable notifications

https://gerrit.wikimedia.org/r/973236

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db2146.codfw.wmnet with OS bookworm completed:

  • db2146 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311092116_marostegui_1848067_db2146.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

I am going to start repooling this host back.

Host back to production.