Page MenuHomePhabricator

Decision request - Upgrade cadence for Ceph
Closed, ResolvedPublic

Description

Problem

Currently we have no policy on Ceph upgrades, and that makes it hard to find the time for it.
Current ceph releases have a lifespan of a bit more than 2 years, and there's a new release every year (see https://docs.ceph.com/en/latest/releases/index.html).

We currently get our "unofficial" packages from https://mirror.croit.io/ as there were some issues with the upstream ones (https://tracker.ceph.com/issues/53411) -- Lately upstream download.ceph.com seems to have caught up on building the packages, so we should use those

So two things have to be decided here, how frequent the upgrades should be, and what to upgrade on each.

Note that I'm considering only N.2.* versions as the others are only for development or test clusters, from the docs:

x.0.z - development versions
x.1.z - release candidates (for test clusters, brave users)
x.2.z - stable/bugfix releases (for users)

Constraints and risks

  • We risk running an unsupported version of Ceph, not getting any new bugfixes or security patches.
  • Debian packages are very delayed with respect upstream, so we might consider using other sources for them

Decision record

Decided for option 3

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T325223_Upgrade_cadence_for_Ceph

Options

Option 1

Do nothing

Pros:

  • No changes to the current workflow

Cons:

  • This means upgrades will be done "whenever we find some time", that's usually when a security patch of blocking bug comes around.
  • Potential EOL (end of life) versions
  • 3rd party repository

Option 2

Frequency: once a year
Version to upgrade to: (N-1).2.*

For example, if we have 16.2.15, and there is a new 18.2.0, we upgrade to 17.2.*, otherwise we upgrade to the latest 16.2.*

Pros:

  • We get a very stable world tested version of ceph

Cons:

  • We might get some months with EOL version
  • We don't get fixes for a whole year
  • We have to allocate time for it once a year (happy path 1 week work, challenging path 1 month work)

Option 3

Frequency: once a year
Version to upgrade to: N.2.*

For example, if we have 16.2.15, and there is a new 17.2.0, we upgrade to 17.2.*, otherwise we upgrade to the latest 16.2.*

Pros:

  • We get a very stable world tested version of ceph
  • We don't get periods running an EOL version

Cons:

  • We don't get fixes for a whole year
  • We have to allocate time for it once a year (happy path 1 week work, challenging path 1 month work)

Option 4

Frequency: every 6 months
Version to upgrade to: (N-1).2.*

For example, if we have 16.2.15, and there is a new 18.2.0, we upgrade to 17.2.*, otherwise we upgrade to the latest 16.2.*

Pros:

  • We get a very stable world tested version of ceph

Cons:

  • We might get some months with EOL version
  • We have to allocate time for it twice a year (happy path 1 week work, challenging path 1 month work)

Option 5

Frequency: every 6 months
Version to upgrade to: N.2.*

For example, if we have 16.2.15, and there is a new 17.2.0, we upgrade to 17.2.*, otherwise we upgrade to the latest 16.2.*

Pros:

  • We get a very stable world tested version of ceph
  • We don't get periods running an EOL version

Cons:

  • We have to allocate time for it twice a year (happy path 1 week work, challenging path 1 month work)

Event Timeline

dcaro renamed this task from Decision request template - Upgrade cadence for Ceph to Decision request - Upgrade cadence for Ceph.Dec 14 2022, 6:11 PM
dcaro updated the task description. (Show Details)
dcaro updated the task description. (Show Details)
dcaro changed the task status from Open to In Progress.Dec 14 2022, 6:22 PM
dcaro added a project: User-dcaro.
dcaro moved this task from To refine to Doing on the User-dcaro board.

Is it possible to pursue a similar policy of N - .5? Or perhaps better said, to upgrade to version N, once it's N.1 or N.2? AFAIK, https://mirror.croit.io/ seems to package things fairly close to release (within a few weeks). So without thinking about changing the deployment paradigm it seems possible to be more up to date?

Can you give more clarity on option 3 of 'we upgrade with debian'? That would likely be slower than now right? You are thinking of running whatever version of ceph is packaged in debian, and not changing until a debian upgrade occurs?

Option 2 has 'once a year' listed as the timing, but I wonder about how often option 3 and option 4 would occur calendar wise? Is there a preferred calendar cadence (aka, once or twice a year?)

A related discussion took place last month in T326945, evaluating and comparing different options for installing Ceph.

Is it possible to pursue a similar policy of N - .5? Or perhaps better said, to upgrade to version N, once it's N.1 or N.2? AFAIK, https://mirror.croit.io/ seems to package things fairly close to release (within a few weeks). So without thinking about changing the deployment paradigm it seems possible to be more up to date?

Yes, I reply below.
Note that since I opened https://tracker.ceph.com/issues/53411, they have been way more diligent on building the packages on download.ceph.com, so we can use those instead (as pointed out on T326945#8523699).

Can you give more clarity on option 3 of 'we upgrade with debian'? That would likely be slower than now right? You are thinking of running whatever version of ceph is packaged in debian, and not changing until a debian upgrade occurs?

Below yes.

Option 2 has 'once a year' listed as the timing, but I wonder about how often option 3 and option 4 would occur calendar wise? Is there a preferred calendar cadence (aka, once or twice a year?)

Yes, this makes me think that there's two things to decide here, the frequency and the content of the upgrades:

Frequency of the upgrades:

  • Upgrading once a year
  • Upgrading every six months
  • Upgrading every three months

Content of the upgrades:

  • When there's a new N.2.0 version ready we upgrade to that one, otherwise to latest N.2.* (for example if we are on 16.2.1, and 17.2.0 is there, we upgrade to that one, otherwise we upgrade to the latest 16.2.*)
  • When there's a new N.2.0 version ready we upgarde to (N-1).2.*, otherwise to latest (N-1).2.* (for example if we are on 16.2.15, and 18.2.0 is there, we upgrade to 17.2.*, otherwise to latest 16.2.*)
  • Using whatever comes with debian repos, this means using what the distro repos packages, that is quite slower than right now.

I'll update the options, I lean towards the first of them btw., latest stable, and I lean towards upgrading every six months or less, and might lean to increase the frequency with time for minor version upgrades.

According to https://docs.ceph.com/en/latest/releases/general/:

  1. X.2.0 is considered a stable release
  2. There is a new stable release cycle every year, targeting the month of March (note in 2022 it became mid-April however)
  3. Releases are supported for 24 months, typically EOL'd just after the new stable release
  4. Stable point release is targeted for every 4 to 6 weeks

So, looking at option 3 or option 5. April/May should be a safe time to upgrade that matches Ceph release timeline. For option 3, could follow that with another upgrade in October/November. 16.2.X had 6 point releases in that 6 month timeframe. 17.2.X had 5. I did notice that ceph does include changes in the point releases, not simply bug / security fixes. For example, they deprecated/removed things in https://docs.ceph.com/en/latest/releases/pacific/#v16-2-5-pacific and added some notable things in https://docs.ceph.com/en/latest/releases/pacific/#v16-2-8-pacific.

Given everything, I think I would prefer option3 or option 5 as a fallback. I would defer on the amount of effort needed to upgrade (maybe start with 5, and make it easier to lean into option 3 if the cadence would be too much currently?). I would also vote to use the upstream packaging. And per your comments, https://download.ceph.com/ does seem to have updated debian packages.

@fnegri Thank you for pointing out the discussion on T326945: Decide on installation details for new ceph cluster. Let's develop solutions together to reduce our overall burden (what versions are they comfortable targeting and what cadence for upgrades?). It seems like WMCS cookbooks are being utilized now + debian packages. I too worry about the fact upstream seems to be pushing cephadm + containers + centos.

I'll cast my vote then :)

I vote for option 3 option 5, so we get some upgrades in-between major releases + major stable release upgrades once a year.

If nobody has any more opinions, I'll declare it decided by the end of the week.

Option 3 has been decided, will create the record shortly