Page MenuHomePhabricator

[ceph] Upgrade to v16
Closed, ResolvedPublic

Description

Quincy (v17) has been released as stable:
https://ceph.io/en/news/blog/2022/v17-2-0-quincy-released/

so we should move to pacific (v16, probably the latest v16 by the time we move), latest when creating the task:
https://docs.ceph.com/en/quincy/releases/pacific/#v16-2-7-pacific

Event Timeline

dcaro triaged this task as High priority.
dcaro changed the task status from Open to In Progress.Jun 2 2022, 2:08 PM
dcaro moved this task from To refine to Doing on the User-dcaro board.
dcaro changed the task status from In Progress to Open.Aug 23 2022, 8:14 AM
dcaro moved this task from Doing to Refined on the User-dcaro board.
fnegri renamed this task from ceph: upgrade to v16 now that v17 is stable to [ceph] Upgrade to v16.Jan 22 2024, 5:25 PM
fnegri added a project: Cloud-VPS.
Aklapper subscribed.

@dcaro: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on October 11th.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

This one is still needed yes, and we should push it next quarter

Change #1113489 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] ceph-pacific: add ceph-pacific packages to bullseye

https://gerrit.wikimedia.org/r/1113489

Change #1113489 merged by David Caro:

[operations/puppet@production] ceph-pacific: add ceph-pacific packages to bullseye

https://gerrit.wikimedia.org/r/1113489

Notes from a discussion today:

After everything is on Bullseye, we can upgrade to 16, 17 and maybe to 18. Then we can do OS upgrades again to bookworm.

When it comes time to reimage for a new OS version, I need a faster reimage process. We can probably do reimages without total drain/undrain if we fix partman to leave partitions in place by reverting some/all of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1075552

Change #1165878 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Include repo for ceph v16 'pacific' on cloudcephmon2004-dev

https://gerrit.wikimedia.org/r/1165878

Change #1165878 merged by Andrew Bogott:

[operations/puppet@production] Include repo for ceph v16 'pacific' on cloudcephmon2004-dev

https://gerrit.wikimedia.org/r/1165878

Change #1165940 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Prepare cloudcephmon nodes in codfw for ceph v16 'pacific'

https://gerrit.wikimedia.org/r/1165940

Change #1165941 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Prepare cloudcephosd nodes in codfw for ceph v16 'pacific'

https://gerrit.wikimedia.org/r/1165941

Change #1165940 merged by Andrew Bogott:

[operations/puppet@production] Prepare cloudcephmon nodes in codfw for ceph v16 'pacific'

https://gerrit.wikimedia.org/r/1165940

Change #1165941 merged by Andrew Bogott:

[operations/puppet@production] Prepare cloudcephosd nodes in codfw for ceph v16 'pacific'

https://gerrit.wikimedia.org/r/1165941

ceph codfw1 is now running 16.2.15 on all nodes. One of the mons is on Bookworm, the others are on bullseye.

Change #1166852 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] partman_early_command: don't wipe out lvm for cloudcephosd nodes

https://gerrit.wikimedia.org/r/1166852

Change #1166852 merged by Andrew Bogott:

[operations/puppet@production] partman_early_command: don't wipe out lvm for cloudcephosd nodes

https://gerrit.wikimedia.org/r/1166852

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-09T15:04:01Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.upgrade_mons (T306820)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-09T18:01:00Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.upgrade_osds (T306820)

ceph eqiad11 is now running 16.2.15 on all nodes. One mon and one OSD are on bookworm (for science), all others are running Bullseye.

One mon and one OSD are on bookworm (for science), all others are running Bullseye.

Looking at SAL, I think that at this time (16:45 UTC) there were actually 3 OSDs on bookworm, and one MON.

Then 3 more OSDs were reimaged later that day, bringing the total to 6 OSDs and 1 MON.

All times below are UTC.

2025-07-09

16:17 	<andrew@cumin1003> 	END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon1004.eqiad.wmnet with OS bookworm

2025-07-10

04:16 	<andrew@cumin1003> 	END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1006.eqiad.wmnet with OS bookworm
12:32 	<andrew@cumin1003> 	END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1007.eqiad.wmnet with OS bookworm
14:56 	<andrew@cumin1003> 	END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1008.eqiad.wmnet with OS bookworm

21:32 	<andrew@cumin2002> 	END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1035.eqiad.wmnet with OS bookworm
23:02 	<andrew@cumin2002> 	END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1036.eqiad.wmnet with OS bookworm

2025-07-11

00:47 	<andrew@cumin2002> 	END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1037.eqiad.wmnet with OS bookworm

This is done :) (thanks @Andrew!)

root@cloudcephosd1006:~# ceph versions
{
    "mon": {
        "ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)": 3
    },
    "osd": {
        "ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)": 295
    },
    "mds": {},
    "rgw": {
        "ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)": 3
    },
    "overall": {
        "ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)": 304
    }
}

The current issue is upgrading os to bookworm