Page MenuHomePhabricator

Elasticsearch dependency upgrade in spicerack
Closed, ResolvedPublic

Description

As part of the parent task, we're preparing to upgrade the cumin hosts to bookworm. See the parent task for the upgrade plan.

The current elasticsearch dependency in spicerack is:

"elasticsearch>=5.0.0,<7.15.0",

Debian bullseye (current cumin hosts) have version 7.1.0.
Debian bookworm has version 7.17.6.
We need to determine if the two versions are compatible with our current code and cookbooks or if we need to make patches to make the code compatible with both versions.

It would be great if someone from Data-Engineering could have a look at the 7.x changelog and previous changelogs and determine the best course of action.

Quickly running CI with version 7.17.12 reports only mypy failures, no unit test failures:

spicerack/elasticsearch_cluster.py: note: In member "_freeze_writes" of class "ElasticsearchCluster":
spicerack/elasticsearch_cluster.py:540: error: Unexpected keyword argument "body" for "index" of "Elasticsearch"  [call-arg]
.tox/py311-mypy/lib/python3.11/site-packages/elasticsearch/client/__init__.pyi:183: note: "index" of "Elasticsearch" defined here
spicerack/elasticsearch_cluster.py: note: In member "_get_unassigned_shards" of class "ElasticsearchCluster":
spicerack/elasticsearch_cluster.py:597: error: List comprehension has incompatible type List[str]; expected List[dict[Any, Any]]  [misc]
spicerack/elasticsearch_cluster.py:597: error: Invalid index type "str" for "str"; expected type "SupportsIndex | slice[Any, Any, Any]"  [index]

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This is our top priority for this week. We'll be looking into replacing the elasticsearch client with making requests directly.

Change #1167299 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/software/spicerack@master] Replace elasticsearch api with python requests

https://gerrit.wikimedia.org/r/1167299

@RKemper Hi! What is the status of the code review? Ping me if you need some help/review :)

Current state:

  • All the previous calls to the elasticsearch library have been replaced
  • Various linting errors that need to be resolved
  • Many unit tests failing; most have similar errors, e.g. spicerack.elasticsearch_cluster.ElasticsearchClusterError: Could not connect to the cluster

cc @elukey @Stevemunene

Have pushed out various improvements to the code. Still much to do on the unit test side.

In short, I think we'll need to replace the existing mocks of the elasticsearch client library methods (e.g., indices.flush, cluster.put_settings, cat.shards etc) with mocks on cluster.make_api_call for the expected route, method, params, and body, returning the example response as JSON. Unless there's a more elegant way that I'm missing :)

Have pushed out various improvements to the code. Still much to do on the unit test side.

In short, I think we'll need to replace the existing mocks of the elasticsearch client library methods (e.g., indices.flush, cluster.put_settings, cat.shards etc) with mocks on cluster.make_api_call for the expected route, method, params, and body, returning the example response as JSON. Unless there's a more elegant way that I'm missing :)

@RKemper Hi! A quicker approach could be to use request-mock and @pytest.fixture(autouse=True), there are multiple examples in spicerack about it. In this way you'll be able to mock your requests transparently without explicitly patching cluster.make_api_call. Lemme know if you want help on it!

Alright, we've got tests passing and it looks like we're ready to merge! It's been a while since I've merged a new spicerack version, are there still a bunch of manual steps that you need to run @elukey or is it pretty hands-off?

(And noted regarding request-mock, that is an approach we can look into in the future once we've got mainline SRE unblocked here)

Change #1167299 merged by Elukey:

[operations/software/spicerack@master] Replace elasticsearch lib w/ spicerack APIClient

https://gerrit.wikimedia.org/r/1167299

Change #1195208 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@master] setup.py: remove the elastic dependency

https://gerrit.wikimedia.org/r/1195208

Change #1195211 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@debian] Remove the elasticsearch dependency

https://gerrit.wikimedia.org/r/1195211

Change #1195224 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@debian] Don't skip elasticsearch tests anymore on older py versions.

https://gerrit.wikimedia.org/r/1195224

Change #1195224 abandoned by Elukey:

[operations/software/spicerack@debian] Don't skip elasticsearch tests anymore on older py versions.

Reason:

wrong branch, will redo it on master :)

https://gerrit.wikimedia.org/r/1195224

Change #1195211 merged by Elukey:

[operations/software/spicerack@debian] Remove the elasticsearch dependency

https://gerrit.wikimedia.org/r/1195211

Change #1195208 merged by Elukey:

[operations/software/spicerack@master] setup.py: remove the elastic dependency

https://gerrit.wikimedia.org/r/1195208

Change #1196923 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@master] Revert workarounds to exclude elasticsearch_cluster.py on Bookworm+

https://gerrit.wikimedia.org/r/1196923

I was about to cut a new spicerack release but I realized that https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1196923 was missing. Once reviewed/merged I'll create the new release.

@RKemper Hi! When the new release will be available I'll install it on one cumin host first to test if everything works as expected. What kind of tests do you have in mind for the elasticsearch module? Due to the timezone difference, I can leave the new version running on my night (the EU night) on a single cumin node so you'll be able to test freely. Would it work? If so when will you be available? Thanks in advance :)

@elukey just wanted to pipe up and offer my assistance as well, since we have slightly more overlap. Feel free to ping me in IRC or Slack when you're ready, I'm typically around from 1330 - 2230 UTC.

As far as what tests we have in mind, we'll want to exercise the following cookbooks:

  • sre.elasticsearch.rolling-operation
  • sre.elasticsearch.ban

Let us know if you need more info!

@bking super thanks a lot, I accept the offer! I'll ping you in this task when the release is ready for a test so we can coordinate!

Change #1196923 merged by Elukey:

[operations/software/spicerack@master] Revert workarounds to exclude elasticsearch_cluster.py on Bookworm+

https://gerrit.wikimedia.org/r/1196923

Looks like Brian already covered it, but just to reiterate, the one cumin test host approach sounds good. We'll just need to exercise a few of the codepaths, I'd probably start with something simple like just hitting the flush synced shards method and then if that works moving to the cookbooks that brian mentioned, which rely indirectly on the elasticsearch spicerack library.

The spicerack release is currently blocked by the release of python3-conftool, hopefully we'll be able to complete the work this week :)

@RKemper @bking cumin1002 is upgraded with the new spicerack version, please feel free to test the cookbooks and see if anything doesn't work. Thanks!

Mentioned in SAL (#wikimedia-operations) [2025-11-05T22:17:50Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.ban Banning hosts: foobar1001.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T22:17:53Z] <ryankemper@cumin1002> END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: foobar1001.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T22:18:14Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch1068.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T22:18:17Z] <ryankemper@cumin1002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch1068.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T22:19:20Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch1068.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T22:19:22Z] <ryankemper@cumin1002> END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: cirrussearch1068.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T22:22:36Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart (test new spicerack version) - ryankemper@cumin1002 - T390860

@elukey Brian and I just tested out a couple of operations, and everything looks good. I think we're ready for the full release.

Mentioned in SAL (#wikimedia-operations) [2025-11-05T22:48:14Z] <ryankemper@cumin1002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart (test new spicerack version) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T23:00:03Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T23:00:16Z] <ryankemper@cumin1002> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-05T23:00:54Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-06T00:49:24Z] <ryankemper@cumin1002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Spicerack deployed, thanks all for the work!

Mentioned in SAL (#wikimedia-operations) [2025-11-06T22:36:06Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-06T22:40:09Z] <ryankemper@cumin1002> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-06T22:46:29Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-07T03:06:53Z] <ryankemper@cumin1002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-07T22:40:49Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-08T00:49:45Z] <ryankemper@cumin1002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Change #1203491 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/cookbooks@master] test_import: Drop workaround for python-elasticsearch

https://gerrit.wikimedia.org/r/1203491

Change #1203496 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/cookbooks@master] Remove leftover import of python-elastic

https://gerrit.wikimedia.org/r/1203496

Change #1203496 abandoned by Muehlenhoff:

[operations/cookbooks@master] Remove leftover import of python-elastic

Reason:

Just just a forgotten import, reopened T390860 instead

https://gerrit.wikimedia.org/r/1203496

Mentioned in SAL (#wikimedia-operations) [2025-11-10T23:39:46Z] <ryankemper@cumin1002> START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-11T08:46:27Z] <ryankemper@cumin1002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860

bking claimed this task.
bking updated Other Assignee, added: RKemper.

Change #1205199 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] elasticsearch: remove ban cookbook

https://gerrit.wikimedia.org/r/1205199

This is this task closed as invalid? This has broken (and still breaks) CI for all cookbooks for a week, see e.g. https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1202150

@Muehlenhoff Apologies, I was taking the ticket for myself to add the above patch. I'm not sure why Phab decided it was time to close the ticket ;) .

bking reopened this task as In Progress.Nov 14 2025, 6:12 PM

Change #1205199 merged by Bking:

[operations/cookbooks@master] elasticsearch: remove ban cookbook

https://gerrit.wikimedia.org/r/1205199

Since the above patch was merged, cookbook CI seems back to normal. As such, I'm closing out this ticket.

Mentioned in SAL (#wikimedia-operations) [2025-11-19T22:16:13Z] <ryankemper@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-19T22:16:17Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-19T22:22:47Z] <ryankemper@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-20T12:13:44Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-20T22:45:02Z] <ryankemper@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-21T04:18:40Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-24T22:45:36Z] <ryankemper@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-24T22:59:14Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-24T22:59:34Z] <ryankemper@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860

Mentioned in SAL (#wikimedia-operations) [2025-11-25T01:44:45Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860

Change #1203491 merged by Muehlenhoff:

[operations/cookbooks@master] test_import: Drop workaround for python-elasticsearch

https://gerrit.wikimedia.org/r/1203491