Page MenuHomePhabricator

Elasticsearch dependency upgrade in spicerack
Closed, ResolvedPublic

Description

As part of the parent task, we're preparing to upgrade the cumin hosts to bookworm. See the parent task for the upgrade plan.

The current elasticsearch dependency in spicerack is:

"elasticsearch>=5.0.0,<7.15.0",

Debian bullseye (current cumin hosts) have version 7.1.0.
Debian bookworm has version 7.17.6.
We need to determine if the two versions are compatible with our current code and cookbooks or if we need to make patches to make the code compatible with both versions.

It would be great if someone from Data-Engineering could have a look at the 7.x changelog and previous changelogs and determine the best course of action.

Quickly running CI with version 7.17.12 reports only mypy failures, no unit test failures:

spicerack/elasticsearch_cluster.py: note: In member "_freeze_writes" of class "ElasticsearchCluster":
spicerack/elasticsearch_cluster.py:540: error: Unexpected keyword argument "body" for "index" of "Elasticsearch"  [call-arg]
.tox/py311-mypy/lib/python3.11/site-packages/elasticsearch/client/__init__.pyi:183: note: "index" of "Elasticsearch" defined here
spicerack/elasticsearch_cluster.py: note: In member "_get_unassigned_shards" of class "ElasticsearchCluster":
spicerack/elasticsearch_cluster.py:597: error: List comprehension has incompatible type List[str]; expected List[dict[Any, Any]]  [misc]
spicerack/elasticsearch_cluster.py:597: error: Invalid index type "str" for "str"; expected type "SupportsIndex | slice[Any, Any, Any]"  [index]

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

As a heads-up, we are migrating from Elastic->OpenSearch in T388610. Once that is done, we should be able to use the libraries provided by Bookworm's python3-opensearch Debian package. See also T383811 and this CR

@bking thanks for the summary. Do you have a timeline for the completion the project?
What strategy would you suggest we could use to avoid coupling the cumin hosts upgrade to bookworm with the whole elasticsearch migration?
Would it be worth to try fixing the few outstanding reported issues with the version 7.17.12 of the current library?

We should probably remove the dependency on the python-elasticsearch library. What we do is simple enough, we could use just python-requests and map the API responses directly to json (which is mostly what the python-elasticsearch library does).

@Volans answers inline:

Do you have a timeline for the completion the project?

Between 2-4 weeks. You can check the progress from cumin: sudo cumin O:elasticsearch::cirrus for the old role, sudo cumin O:cirrus::opensearch for the new role

What strategy would you suggest we could use to avoid coupling the cumin hosts upgrade to bookworm with the whole elasticsearch migration?

This is a cheap suggestion since it puts the responsibility on you, but would it be possible to deploy a new cumin host without removing the existing hosts immediately? Running a mixed Elastic/OpenSearch cluster is not optimal for stability, and there's also added latency for users since we have to shut off one of the DCs while we do that work. We can tackle this with high priority once the migration is finished.

As far as our own Spicerack/cookbook code: @Gehel 's suggestion above is probably the way we'll go, but there are a few other things we could do as well. I've created T383811 so we can discuss with the whole team before making a suggestion.

Would it be worth to try fixing the few outstanding reported issues with the version 7.17.12 of the current library?

It probably wouldn't work, as newer versions of the Elastic python library don't work with OpenSearch.

Let us know if you are able to keep at least one bullseye cumin host until the migration is complete, or if you have any other questions/suggestions.

As reported on the parent task we will create a new host with bookworm and keep the old ones:

We can create a parallel cumin1003 VM on Bookworm, test/adapt everything, then reimage cumin2002 and finally decom cumin1002.

But the problem is that in order to do that we need to release a spicerack version that is compatible with the dependencies present in bookworm, ideally without blocking any further changes on the bullseye version as we might need to make changes to the old setup while the new one is tested. In brief we need to be forward and backward compatible.
The other approach would be to have separate "master" branches and have two different spicerack versions for the two different OS versions, but if possible I'd prefer to avoid the related complexity of managing those for just the brief upgrade time.

As for the timeline, 2-4 weeks doesn't seem too bad, but we have to check with @MoritzMuehlenhoff. Assuming that by then we have clear path/solution for the current elasticsearch library in spicerack.

As reported on the parent task we will create a new host with bookworm and keep the old ones:

We can create a parallel cumin1003 VM on Bookworm, test/adapt everything, then reimage cumin2002 and finally decom cumin1002.

But the problem is that in order to do that we need to release a spicerack version that is compatible with the dependencies present in bookworm, ideally without blocking any further changes on the bullseye version as we might need to make changes to the old setup while the new one is tested. In brief we need to be forward and backward compatible.
The other approach would be to have separate "master" branches and have two different spicerack versions for the two different OS versions, but if possible I'd prefer to avoid the related complexity of managing those for just the brief upgrade time.

The alternative would be to have build/feature flags, allowing to build Spicerack without a given library (along with some indication which cookbooks need a given feature flag). This would allow to e.g. disable Elastic for the build on Bookworm initially (so that the cookbooks needing it, must be run on the old Cumin host). It's not something we can easily add now, but might be worth looking into so that it's ready for the trixie migration? The more features we add to Spicerack and the more dependencies we add, the complexity will only increase.

As for the timeline, 2-4 weeks doesn't seem too bad, but we have to check with @MoritzMuehlenhoff. Assuming that by then we have clear path/solution for the current elasticsearch library in spicerack.

Sounds fine to me.

Not to get too far off-topic, but have y'all considered something like Ansible Execution Environments for Spicerack? That basically means you could have custom Spicerack Docker images for different use cases. For example, my team could use an older Spicerack image while you move ahead with the cumin upgrade.

As far as our future plans, we haven't decided everything yet, but we'll probably move as much as possible out of Spicerack and directly into our cookbooks. There's a lot of friction involved with splitting the orchestration between two different repos, including one (Spicerack) that is difficult to test and puts more responsibilities on the IF team to deploy. We're just doing REST API calls and shell commands, so we should be able to simplify without too much effort.

As far as our future plans, we haven't decided everything yet, but we'll probably move as much as possible out of Spicerack and directly into our cookbooks. There's a lot of friction involved with splitting the orchestration between two different repos, including one (Spicerack) that is difficult to test and puts more responsibilities on the IF team to deploy. We're just doing REST API calls and shell commands, so we should be able to simplify without too much effort.

My 2c: I think that we should aim to keep a nice and tidy module on spicerack, so we'll have a baseline of core functionalities that will be tested etc.. If there are frictions in developing spicerack let us know, we can work together to reduce them. The goal of the spicerack lib is to be easy and usable by everybody, there is very little toll nowadays to release a new version for I/F :)

As far as our future plans, we haven't decided everything yet, but we'll probably move as much as possible out of Spicerack and directly into our cookbooks. There's a lot of friction involved with splitting the orchestration between two different repos, including one (Spicerack) that is difficult to test and puts more responsibilities on the IF team to deploy. We're just doing REST API calls and shell commands, so we should be able to simplify without too much effort.

My 2c: I think that we should aim to keep a nice and tidy module on spicerack, so we'll have a baseline of core functionalities that will be tested etc.. If there are frictions in developing spicerack let us know, we can work together to reduce them. The goal of the spicerack lib is to be easy and usable by everybody, there is very little toll nowadays to release a new version for I/F :)

Sure, let's talk about this once our OpenSearch migration is complete. Per T383811, there are other stakeholders (Observability, WMCS) we might want to engage as well. Hopefully we can find a solution that works well for all of us.

For now I've prepared a patch to exclude elasticsearch functionalities from the bookworm build while keeping them for bullseye, effectively decoupling the two upgrades and allowing to keep doing spicerack upgrades on both OSes.

As for the future of the elasticsearch module, happy to discuss requirements, features, etc... As Luca pointed out I would suggest to have some base module in spicerack that does take care of the common functionalities that are useful for all elastic/open-search clusters and keep the specific logic of each cluster in the cookbooks. The module could totally just use requests if you don't see much gain in using the elastic/open-search python modules.

Change #1142557 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] elasticsearch: temporarily remove it from bookworm

https://gerrit.wikimedia.org/r/1142557

Change #1142557 merged by jenkins-bot:

[operations/software/spicerack@master] elasticsearch: temporarily remove it from bookworm

https://gerrit.wikimedia.org/r/1142557

Change #1143098 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] elasticsearch: do not fail on Python 3.10+

https://gerrit.wikimedia.org/r/1143098

Change #1143098 merged by jenkins-bot:

[operations/cookbooks@master] elasticsearch: do not fail on Python 3.10+

https://gerrit.wikimedia.org/r/1143098

This is our top priority for this week. We'll be looking into replacing the elasticsearch client with making requests directly.

Change #1167299 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/software/spicerack@master] Replace elasticsearch api with python requests

https://gerrit.wikimedia.org/r/1167299

@RKemper Hi! What is the status of the code review? Ping me if you need some help/review :)

Current state:

  • All the previous calls to the elasticsearch library have been replaced
  • Various linting errors that need to be resolved
  • Many unit tests failing; most have similar errors, e.g. spicerack.elasticsearch_cluster.ElasticsearchClusterError: Could not connect to the cluster

cc @elukey @Stevemunene

Have pushed out various improvements to the code. Still much to do on the unit test side.

In short, I think we'll need to replace the existing mocks of the elasticsearch client library methods (e.g., indices.flush, cluster.put_settings, cat.shards etc) with mocks on cluster.make_api_call for the expected route, method, params, and body, returning the example response as JSON. Unless there's a more elegant way that I'm missing :)

Have pushed out various improvements to the code. Still much to do on the unit test side.

In short, I think we'll need to replace the existing mocks of the elasticsearch client library methods (e.g., indices.flush, cluster.put_settings, cat.shards etc) with mocks on cluster.make_api_call for the expected route, method, params, and body, returning the example response as JSON. Unless there's a more elegant way that I'm missing :)

@RKemper Hi! A quicker approach could be to use request-mock and @pytest.fixture(autouse=True), there are multiple examples in spicerack about it. In this way you'll be able to mock your requests transparently without explicitly patching cluster.make_api_call. Lemme know if you want help on it!

Alright, we've got tests passing and it looks like we're ready to merge! It's been a while since I've merged a new spicerack version, are there still a bunch of manual steps that you need to run @elukey or is it pretty hands-off?

(And noted regarding request-mock, that is an approach we can look into in the future once we've got mainline SRE unblocked here)

Change #1167299 merged by Elukey:

[operations/software/spicerack@master] Replace elasticsearch lib w/ spicerack APIClient

https://gerrit.wikimedia.org/r/1167299

Change #1195208 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@master] setup.py: remove the elastic dependency

https://gerrit.wikimedia.org/r/1195208

Change #1195211 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@debian] Remove the elasticsearch dependency

https://gerrit.wikimedia.org/r/1195211

Change #1195224 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@debian] Don't skip elasticsearch tests anymore on older py versions.

https://gerrit.wikimedia.org/r/1195224

Change #1195224 abandoned by Elukey:

[operations/software/spicerack@debian] Don't skip elasticsearch tests anymore on older py versions.

Reason:

wrong branch, will redo it on master :)

https://gerrit.wikimedia.org/r/1195224

Change #1195211 merged by Elukey:

[operations/software/spicerack@debian] Remove the elasticsearch dependency

https://gerrit.wikimedia.org/r/1195211

Change #1195208 merged by Elukey:

[operations/software/spicerack@master] setup.py: remove the elastic dependency

https://gerrit.wikimedia.org/r/1195208

Change #1196923 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@master] Revert workarounds to exclude elasticsearch_cluster.py on Bookworm+

https://gerrit.wikimedia.org/r/1196923

I was about to cut a new spicerack release but I realized that https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1196923 was missing. Once reviewed/merged I'll create the new release.

@RKemper Hi! When the new release will be available I'll install it on one cumin host first to test if everything works as expected. What kind of tests do you have in mind for the elasticsearch module? Due to the timezone difference, I can leave the new version running on my night (the EU night) on a single cumin node so you'll be able to test freely. Would it work? If so when will you be available? Thanks in advance :)

@elukey just wanted to pipe up and offer my assistance as well, since we have slightly more overlap. Feel free to ping me in IRC or Slack when you're ready, I'm typically around from 1330 - 2230 UTC.

As far as what tests we have in mind, we'll want to exercise the following cookbooks:

  • sre.elasticsearch.rolling-operation
  • sre.elasticsearch.ban

Let us know if you need more info!

@bking super thanks a lot, I accept the offer! I'll ping you in this task when the release is ready for a test so we can coordinate!

Change #1196923 merged by Elukey:

[operations/software/spicerack@master] Revert workarounds to exclude elasticsearch_cluster.py on Bookworm+

https://gerrit.wikimedia.org/r/1196923

Looks like Brian already covered it, but just to reiterate, the one cumin test host approach sounds good. We'll just need to exercise a few of the codepaths, I'd probably start with something simple like just hitting the flush synced shards method and then if that works moving to the cookbooks that brian mentioned, which rely indirectly on the elasticsearch spicerack library.

The spicerack release is currently blocked by the release of python3-conftool, hopefully we'll be able to complete the work this week :)

@RKemper @bking cumin1002 is upgraded with the new spicerack version, please feel free to test the cookbooks and see if anything doesn't work. Thanks!

Mentioned in SAL (#wikimedia-operations) [2025-11-05T22:17:50Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.ban Banning hosts: foobar1001.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T22:17:53Z] <ryankemper@cumin1002> END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: foobar1001.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T22:18:14Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch1068.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T22:18:17Z] <ryankemper@cumin1002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch1068.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T22:19:20Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch1068.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T22:19:22Z] <ryankemper@cumin1002> END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: cirrussearch1068.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T22:22:36Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart (test new spicerack version) - ryankemper@cumin1002 - T390860

@elukey Brian and I just tested out a couple of operations, and everything looks good. I think we're ready for the full release.

Mentioned in SAL (#wikimedia-operations) [2025-11-05T22:48:14Z] <ryankemper@cumin1002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart (test new spicerack version) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T23:00:03Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T23:00:16Z] <ryankemper@cumin1002> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T23:00:54Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-06T00:49:24Z] <ryankemper@cumin1002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Spicerack deployed, thanks all for the work!

Mentioned in SAL (#wikimedia-operations) [2025-11-06T22:36:06Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-06T22:40:09Z] <ryankemper@cumin1002> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-06T22:46:29Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-07T03:06:53Z] <ryankemper@cumin1002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-07T22:40:49Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-08T00:49:45Z] <ryankemper@cumin1002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860