Page MenuHomePhabricator

Cloud Ceph misbehaving on Debian Bookworm
Closed, ResolvedPublic

Description

While working on T306820: [ceph] Upgrade to v16, we observed the following:

This task is to investigate what is the issue that caused Ceph to misbehave on the upgraded hosts.

Some graphs from the incident doc:

CPU usage on the affected hosts is high. This is depicted both in the percentiles graph as well individually

image5.png (1×1 px, 192 KB)

image8.png (365×1 px, 650 KB)

Memory usage in also high on the affected hosts, explaining swap usage and md resync (which happens on first access)

image1.png (359×1 px, 320 KB)

Running processes are really weird for the affected hosts:

image2.png (340×1 px, 462 KB)

And similarly Disk utilization

image4.png (406×1 px, 647 KB)

Details

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

cloudcephosd1006 was reimaged again to bookworm on 2025-07-16 at 18:21 UTC. The disk utilization graph shows a similar dip:

Screenshot 2025-07-17 at 17.32.35.png (1×2 px, 2 MB)

Memory utilization on cloudcephosd1006 is increasing and it looks like it might crash the server in a few hours:

Screenshot 2025-07-17 at 17.48.44.png (1×2 px, 350 KB)

This is how it looks like on cloudcephosd1007:

Screenshot 2025-07-17 at 17.50.09.png (1×2 px, 398 KB)

fnegri changed the task status from Open to In Progress.Jul 18 2025, 8:16 AM
fnegri claimed this task.

Memory usage on cloudcephosd1006 did reset at 18:00 UTC yesterday (I think it was due to @dcaro tweaking a setting on the host). Now it's growing again... I'll keep an eye on it today.

Screenshot 2025-07-18 at 10.15.36.png (888×1 px, 96 KB)

I'm claiming this task temporarily as @dcaro is out today.

The growth rate is slowing, but it's not flatlining as I hoped... So the server is likely to crash in a few hours.

Screenshot 2025-07-18 at 16.50.52.png (1×2 px, 334 KB)

Re-assigning to @Andrew as I will log off shortly.

cloudcephosd1006 alerted over night; I'm going to reboot it so that we get another 36 or so hours of good behavior.

Mentioned in SAL (#wikimedia-cloud) [2025-07-20T10:44:19Z] <andrewbogott> rebooting cloudcephosd1006 to give us another few days before the memory runs out. T399858

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-21T13:59:08Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.depool_and_destroy (T399858)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-21T15:04:02Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T399858)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-21T15:07:36Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.depool_and_destroy (T399858)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-21T15:07:42Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T399858)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-21T15:51:03Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T399858)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-21T15:51:10Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T399858)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-21T15:53:08Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T399858)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-21T15:53:14Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) (T399858)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-21T15:56:34Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T399858)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-21T19:44:22Z] <andrew@cloudcumin1001> END (ERROR) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=97) (T399858)

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1003 for host cloudcephosd1006.eqiad.wmnet with OS bullseye

cloudcephosd1006 was reimaged again on 2025-07-21, but this time without keeping the data.

This made it consume all the memory even faster:

Screenshot 2025-07-22 at 10.46.51.png (1×2 px, 401 KB)

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1003 for host cloudcephosd1006.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1006 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507220855_dcaro_2825422_cloudcephosd1006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1178598 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] ceph codfw1dev: revert back to pacific

https://gerrit.wikimedia.org/r/1178598

Change #1178598 merged by Andrew Bogott:

[operations/puppet@production] ceph codfw1dev: revert back to pacific

https://gerrit.wikimedia.org/r/1178598

I am now upgrading the cluster to Bookworm + Reef (18.x) and that seems to bypass this issue.