Page MenuHomePhabricator

🟣 Spin up an Elasticsearch 7.10.2 cluster
Closed, ResolvedPublic

Description

We need to spin up a new Elasticsearch 7.10.2 alongside our current Elasticsearch 6.8.23 in both our staging and production environments.

We now create a our own Elasticsearch 7.10.2 Bitnami-based image in: https://github.com/wbstack/elasticsearch

This ticket can be considered done when a new Elasticsearch 7.10.2 cluster is running on:

https://phabricator.wikimedia.org/T330998#8823343

Event Timeline

Tarrow changed the task status from Open to Stalled.Jun 5 2023, 9:23 AM
Andrew-WMDE changed the task status from Stalled to Open.Jun 12 2023, 1:48 PM
Andrew-WMDE claimed this task.
Andrew-WMDE moved this task from To do to Doing on the Wikibase Cloud (Kanban board Q2 2023) board.

After updating to kubernetes version 1.25+ we realised that using the chart from https://github.com/wmde/wbaas-deploy/pull/887 wouldn't be possible.

After talking to WMF cloud people it sounds like we expect them to continue using ES 7.10 for a while (maybe even years?) therefore we plan to investigate using ECK to deploy ES 7 in some sort of supported way.

We discussed this internally at the time that we discovered the issues with the legacy ES Charts and the newer k8s see: https://docs.google.com/document/d/19wYNkgZgv-CjD-QYWBOXhHggMzXfXBSqy3rE5xuTpOg/edit for some internal figuring out

It looks like the ECK option might be dead in the water. While the ECK helm operator is free to use, it looks like the underlying Elasticsearch helm chart requires an enterprise license.

see https://github.com/elastic/cloud-on-k8s/issues/6261
see https://github.com/elastic/cloud-on-k8s/blob/main/deploy/README.md#licensing

Patches:

Thought I'd copy some of our ad hoc discussions on mattermost to here to try and help people follow along.

It looks like the ECK option might be dead in the water. While the ECK helm operator is free to use, it looks like the underlying Elasticsearch helm chart requires an enterprise license.

That's definitely unfortunate; I guess we could create our own Helm Chart that would be openly licensed? This would look like https://github.com/elastic/cloud-on-k8s/blob/main/deploy/eck-stack/charts/eck-elasticsearch/templates/elasticsearch.yaml I supposed but probably in a minimal way (with more stuff hardcoded) and written from scratch to not violate the license terms.

Here's a look at using Elasticsearch charts from Bitnami instead.

This looks like it's dependent on us building a new image with a bitnami base. We could look at doing this independently or using the release pipeline. The pros of using the release pipeline are that it would presumably not need to be maintained by us. However, we would need to convince the release pipeline team to accept our patch and the maintenance burden for a second elasticsearch image.

Right now this seems to me like a bit of an unrealistic ask although we should have a cross team conversation about if we could share images in the future.

If we were to do this independently I would expect this would then look like any of our other components. This would mean a new github repo containing the Dockerfile and perhaps some tests. It would also mean pushing the image to ghcr for storage and then pulling from there in the deploy patch. To me this seems like the path of least resistance at the moment.

I think it would make driving T335854 slightly easier if we had an agreed config for the second cluster

As it's the option with the least moving parts and externally managed unknowns, I would also favor using the Bitnami chart in conjunction with an image we maintain ourselves.

Looks like we've gone ahead with the self maintained image option!

There is a new repo created at https://github.com/wbstack/elasticsearch which has been loaded with an initial commit for a 7.10 image. I've gone ahead and made the "wmde contributors" group in that organisation Admins on that repo.

https://github.com/wmde/wbaas-deploy/pull/960 is up which is a commit that fixes up the naming of the values files (from honey to 1) and also deploys the new release to the local dev configuration

Is this now ready for review? If so I suggest that whoever does the review of this PR also does a review of the "inital commit" in the image repo.

Is this now ready for review?

Yes, let's merge the local cluster changes before moving on to staging and production.

Deniz_WMDE added a subscriber: Deniz_WMDE.

This attempt failed, because the resources available for the staging cluster were to little.

Here is a new PR with tweaked resource definitions, but I think we have to increase staging node resources to make this work: https://github.com/wmde/wbaas-deploy/pull/980

Also, the old 7.10.2 release was still hanging around on staging, probably because we missed to uninstall them previously. I removed them via $ kubectl delete statefulsets.apps elasticsearch-honey-master, because re-introducing the release and setting installed: false didn't work: https://phabricator.wikimedia.org/P49523

uninstall.go:97: [debug] uninstall: Deleting elasticsearch-honey
uninstall.go:119: [debug] uninstall: Failed to delete release: [unable to build kubernetes objects for delete: resource mapping not found for name: "elasticsearch-honey-master-pdb" namespace: "" from "": no matches for kind "PodDisruptionBudget" in version "policy/v1beta1"
ensure CRDs are installed first]
Error: failed to delete release: elasticsearch-honey
helm.go:84: [debug] failed to delete release: elasticsearch-honey

Following up on the insufficient resources on the staging cluster, here is a quick comparison (with estimated prices) of options I picked, alongside current references:

(all costs estimated with the GCP billing workload estimate feature for GCP project wikibase-cloud: link)

Reference - current machine types:
machine typeCPURAMcost per 3 nodesnote
n2-highmem-16161282,705.75 €currently in use for production cluster nodes
n2-standard-4416501.45 €currently in use for staging cluster nodes
Suggested alternatives for the staging cluster:
machine typeCPURAMcost per 3 nodesnote
e2-standard-8832691.93 €CPU selection based on availability
n2d-standard-8832872.48 €AMD EPYC CPU
n2-standard-88321,002.90 €Intel Cascade Lake and Ice Lake CPUs
My personal conclusion

Since (currently) performance isn't an important aspect for the staging cluster, I'd suggest to go with the cheapest machine type from this list e2-standard-8, which would result in (very roughly speaking) double the resources for ~40% more of the costs.

About temporarily scaling: If my understanding is correct, the way we'd have to go in production would be the following: add 3 nodes of the same type to the cluster until we can tear down ES 6, then remove the 3 nodes again. Did you consider this option as well?

Tarrow changed the task status from Open to Stalled.Jul 17 2023, 9:31 AM

Stalled to prevent us spending a load of money before we're ready to start using this cluster

Deniz_WMDE renamed this task from Spin up an Elasticsearch 7.10.2 cluster to 🟠 Spin up an Elasticsearch 7.10.2 cluster.Aug 21 2023, 9:10 AM
Deniz_WMDE renamed this task from 🟠 Spin up an Elasticsearch 7.10.2 cluster to 🟣 Spin up an Elasticsearch 7.10.2 cluster.Aug 21 2023, 9:20 AM
Andrew-WMDE changed the task status from Stalled to Open.Oct 20 2023, 8:37 AM
Evelien_WMDE claimed this task.