Page MenuHomePhabricator

[ceph,codfw1dev] upgrade the hosts from pacific->quincy
Closed, ResolvedPublic

Description

In order to test the upgrade process and try to workaround the bookworm+quincy issue, we want to try the process:

Event Timeline

Doing this upgrade, the mons crashed, the error they shown was about using an old mon store db format, specifically the error https://pve.proxmox.com/wiki/Ceph_Pacific_to_Quincy#Important_Release_Notes

The note in the upgrade docs is quite hidden (https://docs.ceph.com/en/latest/releases/quincy/#major-changes-from-pacific):

LevelDB support has been removed. WITH_LEVELDB is no longer a supported build option. Users should migrate their monitors and OSDs to RocksDB before upgrading to Quincy.

I'm trying to figure out the current state, what I'm seeing is:

  • all nodes are in ceph quincy (v17)
  • all nodes are in debian bullseye (v11)

The cluster is not working, the mons are down:

  • cloudcephmon2004-dev: the mon service is up and running, though ceph commands are not working
  • cloudcephmon2005-dev: mon service down, mrg service down
  • cloudcephmon2006-dev: mon service down

From the logs, it seems that cloudcephmon2004-dev has been reset, and thinks it's the only mon:

mon.cloudcephmon2004-dev is new leader, mons cloudcephmon2004-dev in quorum (ranks 0)

So I'm guessing that @Andrew started rebuilding the cluster mon layer, I'll try to pick up from there.

Unfortunately, the cli seems unable to connect (maybe auth?) and the ceph-mon tool fails to dump the mon store (without error, just creates nothing).

I copied ower to 2006 the 2004 mon store (just cp of /var/lib/mon/ceph-cloudceph2004-dev), but did not work (got unable to read magic from mon data).

Looking now to make the client work on 2004 to do the mon dump from there

dcaro triaged this task as High priority.Jul 24 2025, 10:42 AM

The client on 2004 keeps getting connection refused:

148677 connect(12, {sa_family=AF_INET, sin_port=htons(3300), sin_addr=inet_addr("10.192.20.19")}, 16) = -1 ECONNREFUSED (Connection refused)

And the mon process is actually not listening on tcp nor udp:

root@cloudcephmon2004-dev:~# ss -tlnp
State        Recv-Q       Send-Q             Local Address:Port             Peer Address:Port       Process                                           
LISTEN       0            512                 10.192.20.19:6789                  0.0.0.0:*           users:(("ceph-mon",pid=145688,fd=28))            
LISTEN       0            1024                10.192.20.19:9290                  0.0.0.0:*           users:(("prometheus-ipmi",pid=867,fd=3))         
LISTEN       0            5                   10.192.20.19:9710                  0.0.0.0:*           users:(("python3",pid=864,fd=3))                 
LISTEN       0            1024                10.192.20.19:9105                  0.0.0.0:*           users:(("prometheus-rsys",pid=35778,fd=5))       
LISTEN       0            128                      0.0.0.0:22                    0.0.0.0:*           users:(("sshd",pid=966,fd=3))                    
LISTEN       0            40                     127.0.0.1:25                    0.0.0.0:*           users:(("exim4",pid=1057,fd=4))                  
LISTEN       0            1024                10.192.20.19:4194                  0.0.0.0:*           users:(("cadvisor",pid=833,fd=10))               
LISTEN       0            5                   10.192.20.19:5666                  0.0.0.0:*           users:(("nrpe",pid=1045,fd=4))                   
LISTEN       0            1024                           *:9100                        *:*           users:(("prometheus-node",pid=874,fd=3))         
LISTEN       0            128                         [::]:22                       [::]:*           users:(("sshd",pid=966,fd=4))                    
LISTEN       0            40                         [::1]:25                       [::]:*           users:(("exim4",pid=1057,fd=5))                  
root@cloudcephmon2004-dev:~# ss -ulnp
State           Recv-Q            Send-Q                       Local Address:Port                       Peer Address:Port           Process

In eqiad it listens on tcp:

root@cloudcephmon1005:~# ss -tlnp | grep 3300
LISTEN 0      512     10.64.148.27:3300      0.0.0.0:*    users:(("ceph-mon",pid=1077,fd=35))

So the mon, even if up, it's not starting correctly :/, looking

Mentioned in SAL (#wikimedia-cloud) [2025-07-24T11:01:37Z] <dcaro> stopping all ceph osds in codfw1 to avoid spamming the mons (T400334)

You can still get some info on the running mon from the mon itself using the asok socket:

root@cloudcephmon2004-dev:~# ceph --admin-daemon /var/run/ceph/ceph-mon.cloudcephmon2004-dev.asok config show
....

Then I was able to get the mon working by disabling cephx on the config, and only setting the v1 endpoint:

[global]
  #auth cluster required = cephx
  #auth service required = cephx
  #auth client required = cephx
  auth cluster required = none
  auth service required = none
  auth client required = none

  fsid = 489c4187-17bc-44dc-9aeb-1d044c9bba9e

  public network = 10.192.20.0/24
  cluster network = 192.168.4.0/24
  log to syslog = true
  err to syslog = true

  mon initial members = cloudcephmon2004-dev
  #mon host = [v2:10.192.20.19:3300/0,v1:10.192.20.19:6789/0]
  mon host = 10.192.20.19

...

This allowed us to run ceph commands, to create a new keyring for the mon and add it to the mon itself:

root@cloudcephmon2004-dev:~# ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'

root@cloudcephmon2004-dev:~# cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-cloudcephmon2004-dev/keyring


#restart mon service

After that, puppet imported some of the keyrings, and I manually imported the admin keyring too:

root@cloudcephmon2004-dev:~# ceph auth import -i /etc/ceph/ceph.client.admin.keyring

Enabling now cephx brought the mon up with auth, that allowed me to enable msg_v2:

root@cloudcephmon2004-dev:~# ceph mon enable-msgr2

And that brought v2 back, allowing to specify back the v1/v2 endpoints in the config:

[global]
  auth cluster required = cephx
  auth service required = cephx
  auth client required = cephx
  #auth cluster required = none
  #auth service required = none
  #auth client required = none

  fsid = 489c4187-17bc-44dc-9aeb-1d044c9bba9e

  public network = 10.192.20.0/24
  cluster network = 192.168.4.0/24
  log to syslog = true
  err to syslog = true

  mon initial members = cloudcephmon2004-dev
  mon host = [v2:10.192.20.19:3300/0,v1:10.192.20.19:6789/0]
  #mon host = 10.192.20.19

...

And now the clients work from any other ceph node, will try to restore the config back and see if the osds are able to join.

with this, I added a few of the config values back:

root@cloudcephmon2004-dev:~# ceph config set mon auth_allow_insecure_global_id_reclaim false
root@cloudcephmon2004-dev:~# ceph config set mon mon_allow_pool_delete false
root@cloudcephmon2004-dev:~# ceph config set mon osd_heartbeat_use_min_delay_socket true
root@cloudcephmon2004-dev:~# ceph config set osd osd_heartbeat_use_min_delay_socket true
root@cloudcephmon2004-dev:~# ceph config set osd osd_snap_trim_sleep_ssd 1.0

Started the mrg on cloudcephmon2005-dev, having to import the key manually (puppet still disabled):

root@cloudcephmon2005-dev:~# ceph auth import -i /var/lib/ceph/mgr/ceph-cloudcephmon2005-dev/keyring

And it came up ok, then set the cluster as norebalance started all the osds:

root@cloudcephmon2004-dev:~# ceph osd set norebalance
... On each cloudcephosd2* node
root@cloudcephmon2004-dev:~# systemctl start ceph-osd.targed

The cluster is now healthy and running \o/, unset the norebalance, and the cluster is shifting some data:

root@cloudcephmon2004-dev:~# ceph osd unset norebalance

root@cloudcephmon2004-dev:~# ceph crash archive-all  # acking the crashes

root@cloudcephmon2004-dev:~# ceph status
  cluster:
    id:     489c4187-17bc-44dc-9aeb-1d044c9bba9e
    health: HEALTH_OK
 
  services:
    mon: 1 daemons, quorum cloudcephmon2004-dev (age 33m)
    mgr: cloudcephmon2005-dev(active, since 18m), standbys: cloudcephmon2004-dev
    osd: 32 osds: 32 up (since 11m), 32 in (since 11m)
 
  data:
    pools:   11 pools, 481 pgs
    objects: 158.98k objects, 1.0 TiB
    usage:   3.0 TiB used, 67 TiB / 70 TiB avail
    pgs:     481 active+clean
 
  io:
    client:   18 KiB/s rd, 62 KiB/s wr, 8 op/s rd, 8 op/s wr

I broke the cluster again, but now it's working. The main thing I did was a version of the 'recovery using OSDs' script here: https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-mon/

I had to modify it quite a bit because it assumes that there are ssh keys set up connecting OSDs and mons. What I did instead was run a variant on my laptop, first:

set -x
ms=/home/andrew/mon-store-thursday
mkdir $ms

hosts="cloudcephosd2004-dev.codfw.wmnet cloudcephosd2005-dev.codfw.wmnet cloudcephosd2006-dev.codfw.wmnet cloudcephosd2007-dev.codfw.wmnet "
# collect the cluster map from stopped OSDs
for host in $hosts; do
  rsync -avz $ms/. $host:$ms.remote
  rm -rf $ms
  ssh $host <<EOF
    for osd in /var/lib/ceph/osd/ceph-*; do
      sudo ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote
      sudo chmod -R a+r $ms.remote
    done
EOF
  rsync -avz $host:$ms.remote/. $ms
done

(Before I could run that I also had to briefly "chown Andrew /var/lib/ceph/osd/" so that the enumeration would work as me)

That script only seemed to actually gather info from the first osd, cloudcephosd2004-dev; I assume that's because the crushmap has things replicated across each node. Once the file was gathered up, I ran this on cloudcephmon20040-dev:

ms=/home/andrew/mon-store-thursday
set -x


ceph-monstore-tool $ms rebuild -- --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring

# make a backup of the corrupted store.db just in case!  repeat for
# all monitors.
mv /var/lib/ceph/mon/ceph-cloudcephmon2004-dev/store.db /var/lib/ceph/mon/ceph-cloudcephmon2004-dev/store.db.corrupted.thursday

# move rebuild store.db into place.  repeat for all monitors.
mv $ms/store.db /var/lib/ceph/mon/ceph-cloudcephmon2004-dev/store.db
chown -R ceph:ceph /var/lib/ceph/mon/ceph-cloudcephmon2004-dev/store.db

That plus David's instructions above (using only v1 and turning off cephx) got the mon up enough to follow the remaining steps.

On the other mons I used the 'adding a monitor (manual) steps from https://docs.ceph.com/en/reef/rados/operations/add-or-rm-mons/ -- that starts with a fresh database that's then populated from the existing mon.

Everything is now on Quincy + Reef.

...and now it's 100% bookworm/reef