Page MenuHomePhabricator

[ceph] 2025-08-21 ceph issues bringing new osds up
Closed, ResolvedPublic

Assigned To
Authored By
dcaro
Aug 21 2025, 8:57 AM
Referenced Files
F65811418: image.png
Aug 21 2025, 9:11 AM
F65811402: image.png
Aug 21 2025, 9:11 AM
F65811385: image.png
Aug 21 2025, 9:11 AM

Description

Second iteration of this: T403043: [ceph] 2025-08-27 ceph outage when bringing in a big osd host all at once (cloudcephosd1048)

From IRC:

andrewbogott> Andrew Bogott In case there are more after-effects later, here's what just happened:
05:55:21 - I added a new ceph node, cloudcephosd1042, with the cookbook. This went haywire because (due to a race condition in puppet) 1042 was running an old version of ceph, v14 (most of the cluster is running v16)
05:55:21 - Somehow when the v14 client tried to talk to the one v17 client (on cloudcephosd1004) it caused the nodes on cloudcephosd1004 to crash
05:55:21 - Again, 'somehow' that crash didn't just cause a rebalance, but caused a bunch of pgs to go read-only. Very weird behavior for only one node going down, but it happened.
05:55:21 - That meant that for a few minutes ceph misbehaved badly enough that some VMs froze, and a lot of toolforge jobs flapped
05:55:22 - As soon as I switched off 1042 and 1004, everything got better
05:55:22 - I restarted some unhappy nfs worker nodes just in case (although I suspect they would've recovered on their own anyway)

This task is to investigate the current status of the cluster and possible next steps.

Current status:

  • cloudcephosd1004:
root@cloudcephmon1004:~# ceph osd df cloudcephosd1004
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS
64    ssd  1.74657   1.00000  1.7 TiB  927 GiB  922 GiB   60 MiB  4.3 GiB  862 GiB  51.80  1.03   79      up
65    ssd  1.74657   1.00000  1.7 TiB  932 GiB  928 GiB   57 MiB  3.9 GiB  857 GiB  52.09  1.04   79      up
66    ssd  1.74657         0      0 B      0 B      0 B      0 B      0 B      0 B      0     0    0    down
67    ssd  1.74657   1.00000  1.7 TiB  961 GiB  957 GiB   56 MiB  4.1 GiB  827 GiB  53.74  1.07   82      up
68    ssd  1.74657   1.00000  1.7 TiB  757 GiB  753 GiB   53 MiB  3.7 GiB  1.0 TiB  42.32  0.84   28      up
69    ssd  1.74657   1.00000  1.7 TiB  819 GiB  815 GiB   55 MiB  3.4 GiB  970 GiB  45.77  0.91   13      up
70    ssd  1.74657   1.00000  1.7 TiB  937 GiB  933 GiB   59 MiB  4.0 GiB  852 GiB  52.38  1.04   81      up
71    ssd  1.74657   1.00000  1.7 TiB  962 GiB  958 GiB   59 MiB  3.9 GiB  826 GiB  53.79  1.07   86      up
                       TOTAL   12 TiB  6.1 TiB  6.1 TiB  398 MiB   27 GiB  6.1 TiB  50.27                   
MIN/MAX VAR: 0.84/1.07  STDDEV: 4.10
root@cloudcephosd1004:~# systemctl status ceph-osd@*  | grep -A 6 '^. *ceph-osd@' 
● ceph-osd@69.service - Ceph object storage daemon osd.69
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Thu 2025-08-21 10:36:56 UTC; 3min 20s ago
    Process: 190780 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 69 (code=exited, status=0/SUCCESS)
   Main PID: 190784 (ceph-osd)
      Tasks: 76
     Memory: 4.3G
--
● ceph-osd@64.service - Ceph object storage daemon osd.64
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Thu 2025-08-21 03:42:18 UTC; 6h ago
   Main PID: 13602 (ceph-osd)
      Tasks: 76
     Memory: 7.1G
        CPU: 54min 42.002s
--
● ceph-osd@65.service - Ceph object storage daemon osd.65
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Thu 2025-08-21 08:41:10 UTC; 1h 59min ago
   Main PID: 142238 (ceph-osd)
      Tasks: 76
     Memory: 7.5G
        CPU: 24min 42.939s
--
● ceph-osd@70.service - Ceph object storage daemon osd.70
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Thu 2025-08-21 03:42:18 UTC; 6h ago
   Main PID: 13562 (ceph-osd)
      Tasks: 76
     Memory: 8.3G
        CPU: 54min 57.624s
--
● ceph-osd@66.service - Ceph object storage daemon osd.66
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
     Active: failed (Result: signal) since Thu 2025-08-21 09:57:00 UTC; 43min ago
    Process: 174549 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 66 (code=exited, status=0/SUCCESS)
    Process: 174555 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 66 --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
   Main PID: 174555 (code=killed, signal=ABRT)
        CPU: 1min 233ms
--
● ceph-osd@67.service - Ceph object storage daemon osd.67
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Thu 2025-08-21 10:11:31 UTC; 28min ago
    Process: 180510 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 67 (code=exited, status=0/SUCCESS)
   Main PID: 180516 (ceph-osd)
      Tasks: 76
     Memory: 6.9G
--
● ceph-osd@71.service - Ceph object storage daemon osd.71
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Thu 2025-08-21 03:42:18 UTC; 6h ago
   Main PID: 13541 (ceph-osd)
      Tasks: 76
     Memory: 6.8G
        CPU: 44min 1.316s
--
● ceph-osd@68.service - Ceph object storage daemon osd.68
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Thu 2025-08-21 10:33:43 UTC; 6min ago
    Process: 189234 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 68 (code=exited, status=0/SUCCESS)
   Main PID: 189239 (ceph-osd)
      Tasks: 76
     Memory: 6.5G
  • cloudcephosd1042, not in the cluster

Open questions

  • Was sal down during the incident?
  • Why are osds in coludcephosd1004 still failing to come up?
  • Why do cloudcephosd1044/47/43 have pings + jumbo frames loss?
  • Why did the addition of the v14 node cause the whole cluster to halt?
  • <add more>

Possible improvements

  • Track the installed ceph version in hiera/cookbook so it will refuse to bring up an osd with the wrong one
  • Improve puppet so it installs the correct ceph version from the repos, instead of just making sure it's installed (as puppet runs before the component repo is setup, it installs an older version)
  • <add more>

Event Timeline

dcaro added a subscriber: Andrew.

Looking at the grafana dashboards, noticed that there's a relatively high loss of jumbo frames:

image.png (1×3 px, 1 MB)

It's not new though, but it's worth looking into, the nodes that have the most packages lost are cloudcephosd1043, cloudcephosd1044 and cloudcephosd1047, those nodes are not yet added to the cluster, so that's kinda ok (will have to check before adding if there's any misconfiguration):

image.png (1×3 px, 715 KB)

Filtering those out, and focusing on cloudcephosd1004, it seems it's also having issues since the event:

image.png (1×3 px, 840 KB)

The loss of the pings towards 1004, and the current source for lost pings are cloudcephosd1043/44/47, they are not yet in the cluster so that should not be an issue.

The loss towards 1004 at ~3am UTC, I'm guessing it's a reboot, though there's nothing on sal https://sal.toolforge.org/admin, might have been down during the incident.

Insisting a bit on starting ceph-osd@65 seemed to get it up and running, maybe there's some "start timeout"?

Mentioned in SAL (#wikimedia-cloud) [2025-08-21T09:44:17Z] <dcaro> starting ceph-osd@66 on cloudcephosd1004 (T402499)

first try to start ceph-osd@66 failed with the same error:

Aug 21 09:44:47 cloudcephosd1004 ceph-osd[168404]: ceph-osd: ./src/osd/PeeringState.cc:1255: bool PeeringState::check_prior_readable_down_osds(const OSDMapRef&): Assertion `HAVE_FEATURE(upacting_features, SERVER_OCTOPUS)' failed.

hmm... I wonder if it has some cache of the cloudcephosd1042 in the old v14 version, and when checking the check_prior_readable_down_osds it finds that it was that old version and crashes? (and at some point that cache gets cleared and then starts)

I have tried extending the systemd unit start timeout to 5 min, see if that helps, though I think it's not getting to the 1m30s default :fingerscrossed:

Still failing, the time it takes is Aug 21 09:54:26 cloudcephosd1004 systemd[1]: ceph-osd@66.service: Consumed 57.214s CPU time.

Hmm.... before crashing, it starts checking old peers:

Aug 21 09:55:36 cloudcephosd1004 ceph-osd[173450]: 2025-08-21T09:55:36.717+0000 7fcf28c85700  1 osd.66 pg_epoch: 72700590 pg[8.86b( v 72700589'93196338 (72700063'93192576,72700589'93196338] local-lis/les=72700588/72700589 n=4992 ec=25427536/271059 lis/c=72700588/72700588 les/c/f=72700589/72700589/0 sis=72700590) [66,161,319]/[66,319,117] r=0 lpr=72700590 pi=[72700588,72700590)/1 crt=72700589'93196338 lcod 0'0 mlcod 0'0 remapped mbc={}] start_peering_interval up [66,319] -> [66,161,319], acting [66,319,117] -> [66,319,117], acting_primary 66 -> 66, up_primary 66 -> 66, role 0 -> 0, features acting 4540138312169291775 upacting 4540138290693341183

there osd 161 does not exist anymore:

root@cloudcephosd1004:~# ceph osd find 161
Error ENOENT: osd.161 does not exist

According to the cluster though, that pg is already moved somewhere else (osds 64, 117 and 319):

root@cloudcephmon1004:~# ceph pg dump pgs  | grep 8.86b
8.86b       5032                   0         0          0        0  19693973504          181          11  3794      3794                 active+clean  2025-08-21T09:16:52.831421+0000    72702991'93202534   72702991:414307841       [64,117,319]          64       [64,117,319]              64    72698525'93173026  2025-08-20T07:04:13.997884+0000    72691922'93105535  2025-08-17T11:34:04.114144+0000              0
dumped pgs

Mentioned in SAL (#wikimedia-cloud) [2025-08-21T10:32:55Z] <dcaro> starting ceph-osd@68 on cloudcephosd1004 (T402499)

Mentioned in SAL (#wikimedia-cloud) [2025-08-21T10:35:54Z] <dcaro> starting ceph-osd@69 on cloudcephosd1004 (T402499)

ceph-osd@69 came up ok too, only 66 is left down

dcaro triaged this task as High priority.Aug 21 2025, 12:16 PM

Mentioned in SAL (#wikimedia-cloud-feed) [2025-08-21T12:17:00Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.depool_and_destroy (T402499)

Mentioned in SAL (#wikimedia-cloud) [2025-08-21T12:18:25Z] <dcaro> destroying osd 66 on cloudcephosd1004, will recreate (T402499)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-08-21T12:50:15Z] <dcaro@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T402499)

Manually zapping /dev/sdb on cloudcephosd1004, as the depool_and_destroy cookbook did not do it (see T402515: [cookbook,ceph] depool_and_destroy ceph cookbook failed to destroy a single osd):

root@cloudcephosd1004:~# ls -la /var/lib/ceph/osd/ceph-66/block 
lrwxrwxrwx 1 ceph ceph 93 Aug 21 03:30 /var/lib/ceph/osd/ceph-66/block -> /dev/ceph-62e49003-b3e0-4ecb-acbc-b82348164434/osd-block-06f40a8e-5b3c-4478-af57-739e819bddee

root@cloudcephosd1004:~# lsblk | grep -B 1 739e819bddee
sdb                                                                                                     8:16   0   1.7T  0 disk  
└─ceph--62e49003--b3e0--4ecb--acbc--b82348164434-osd--block--06f40a8e--5b3c--4478--af57--739e819bddee 253:5    0   1.7T  0 lvm   

root@cloudcephosd1004:~# ceph-volume lvm zap --destroy /dev/sdb
--> Zapping: /dev/sdb
--> Zapping lvm member /dev/sdb. lv_path is /dev/ceph-62e49003-b3e0-4ecb-acbc-b82348164434/osd-block-06f40a8e-5b3c-4478-af57-739e819bddee
Running command: /usr/bin/dd if=/dev/zero of=/dev/ceph-62e49003-b3e0-4ecb-acbc-b82348164434/osd-block-06f40a8e-5b3c-4478-af57-739e819bddee bs=1M count=10 conv=fsync
 stderr: 10+0 records in
10+0 records out
 stderr: 10485760 bytes (10 MB, 10 MiB) copied, 0.0315374 s, 332 MB/s
--> Only 1 LV left in VG, will proceed to destroy volume group ceph-62e49003-b3e0-4ecb-acbc-b82348164434
Running command: vgremove -v -f ceph-62e49003-b3e0-4ecb-acbc-b82348164434
 stderr: Removing ceph--62e49003--b3e0--4ecb--acbc--b82348164434-osd--block--06f40a8e--5b3c--4478--af57--739e819bddee (253:5)
 stderr: Archiving volume group "ceph-62e49003-b3e0-4ecb-acbc-b82348164434" metadata (seqno 6).
 stderr: Releasing logical volume "osd-block-06f40a8e-5b3c-4478-af57-739e819bddee"
 stderr: Creating volume group backup "/etc/lvm/backup/ceph-62e49003-b3e0-4ecb-acbc-b82348164434" (seqno 7).
 stdout: Logical volume "osd-block-06f40a8e-5b3c-4478-af57-739e819bddee" successfully removed
 stdout: Volume group "ceph-62e49003-b3e0-4ecb-acbc-b82348164434" successfully removed
 stderr: Removing physical volume "/dev/sdb" from volume group "ceph-62e49003-b3e0-4ecb-acbc-b82348164434"
Running command: pvremove -v -f -f /dev/sdb
 stdout: Labels on physical volume "/dev/sdb" successfully wiped.
Running command: /usr/bin/dd if=/dev/zero of=/dev/sdb bs=1M count=10 conv=fsync
 stderr: 10+0 records in
10+0 records out
 stderr: 10485760 bytes (10 MB, 10 MiB) copied, 0.0300213 s, 349 MB/s
--> Zapping successful for: <Raw Device: /dev/sdb>

Will add it anew.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-08-21T12:57:52Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T402499)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-08-21T12:58:29Z] <dcaro@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T402499)