⚓ T335853 🟣 Spin up an Elasticsearch 7.10.2 cluster

Status	Assigned	Task
Resolved	Andrew-WMDE	T330998 🟪 Update ElasticSearch to 7.10.2
Resolved	Evelien_WMDE	T335853 🟣 Spin up an Elasticsearch 7.10.2 cluster
Resolved	Evelien_WMDE	T335891 🟣 Wire up and initialize Elasticsearch 7.10.2 cluster
Resolved	Evelien_WMDE	T335893 🟣 Switch reads over to Elasticsearch 7.10.2 cluster
Resolved	Evelien_WMDE	T335894 🟣 Remove Elasticsearch 6.8.22 cluster

Andrew-WMDE created this task.May 3 2023, 12:15 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 3 2023, 12:15 PM

Andrew-WMDE added a parent task: T330998: 🟪 Update ElasticSearch to 7.10.2.May 3 2023, 12:15 PM

Andrew-WMDE added a subtask: T335891: 🟣 Wire up and initialize Elasticsearch 7.10.2 cluster.May 3 2023, 5:14 PM

Evelien_WMDE moved this task from WB Cloud Sprint 19 to WB Cloud Sprint 20 on the Wikibase Cloud board.May 10 2023, 11:45 AM

Evelien_WMDE edited projects, added Wikibase Cloud (WB Cloud Sprint 20); removed Wikibase Cloud (WB Cloud Sprint 19).

Andrew-WMDE updated the task description. (Show Details)May 11 2023, 3:09 PM

Andrew-WMDE updated the task description. (Show Details)May 11 2023, 3:24 PM

Andrew-WMDE moved this task from Sprint Backlog to In Review on the Wikibase Cloud (WB Cloud Sprint 20) board.

Tarrow moved this task from In Review to Waiting for Deploy to Production on the Wikibase Cloud (WB Cloud Sprint 20) board.May 12 2023, 1:51 PM

Evelien_WMDE moved this task from WB Cloud Sprint 20 to Kanban board Q2 2023 on the Wikibase Cloud board.May 24 2023, 12:09 PM

Evelien_WMDE edited projects, added Wikibase Cloud (Kanban board Q2 2023); removed Wikibase Cloud (WB Cloud Sprint 20).

Evelien_WMDE moved this task from To do to Waiting for Deploy to Production on the Wikibase Cloud (Kanban board Q2 2023) board.

Tarrow changed the task status from Open to Stalled.Jun 5 2023, 9:23 AM

Tarrow moved this task from Waiting for Deploy to Production to To do on the Wikibase Cloud (Kanban board Q2 2023) board.Jun 8 2023, 9:19 AM

Andrew-WMDE changed the task status from Stalled to Open.Jun 12 2023, 1:48 PM

Andrew-WMDE claimed this task.

Andrew-WMDE moved this task from To do to Doing on the Wikibase Cloud (Kanban board Q2 2023) board.

After updating to kubernetes version 1.25+ we realised that using the chart from https://github.com/wmde/wbaas-deploy/pull/887 wouldn't be possible.

After talking to WMF cloud people it sounds like we expect them to continue using ES 7.10 for a while (maybe even years?) therefore we plan to investigate using ECK to deploy ES 7 in some sort of supported way.

We discussed this internally at the time that we discovered the issues with the legacy ES Charts and the newer k8s see: https://docs.google.com/document/d/19wYNkgZgv-CjD-QYWBOXhHggMzXfXBSqy3rE5xuTpOg/edit for some internal figuring out

Tarrow updated the task description. (Show Details)Jun 12 2023, 4:07 PM

It looks like the ECK option might be dead in the water. While the ECK helm operator is free to use, it looks like the underlying Elasticsearch helm chart requires an enterprise license.

see https://github.com/elastic/cloud-on-k8s/issues/6261
see https://github.com/elastic/cloud-on-k8s/blob/main/deploy/README.md#licensing

Patches:

wmde/wbaas-deploy/pull/949 | feat(elasticsearch): add Elasticsearch 7.10.2 using ECK

Here's a look at using Elasticsearch charts from Bitnami instead.

Patches:

Thought I'd copy some of our ad hoc discussions on mattermost to here to try and help people follow along.

It looks like the ECK option might be dead in the water. While the ECK helm operator is free to use, it looks like the underlying Elasticsearch helm chart requires an enterprise license.

That's definitely unfortunate; I guess we could create our own Helm Chart that would be openly licensed? This would look like https://github.com/elastic/cloud-on-k8s/blob/main/deploy/eck-stack/charts/eck-elasticsearch/templates/elasticsearch.yaml I supposed but probably in a minimal way (with more stuff hardcoded) and written from scratch to not violate the license terms.

Here's a look at using Elasticsearch charts from Bitnami instead.

This looks like it's dependent on us building a new image with a bitnami base. We could look at doing this independently or using the release pipeline. The pros of using the release pipeline are that it would presumably not need to be maintained by us. However, we would need to convince the release pipeline team to accept our patch and the maintenance burden for a second elasticsearch image.

Right now this seems to me like a bit of an unrealistic ask although we should have a cross team conversation about if we could share images in the future.

If we were to do this independently I would expect this would then look like any of our other components. This would mean a new github repo containing the Dockerfile and perhaps some tests. It would also mean pushing the image to ghcr for storage and then pulling from there in the deploy patch. To me this seems like the path of least resistance at the moment.

I think it would make driving T335854 slightly easier if we had an agreed config for the second cluster

As it's the option with the least moving parts and externally managed unknowns, I would also favor using the Bitnami chart in conjunction with an image we maintain ourselves.

Looks like we've gone ahead with the self maintained image option!

There is a new repo created at https://github.com/wbstack/elasticsearch which has been loaded with an initial commit for a 7.10 image. I've gone ahead and made the "wmde contributors" group in that organisation Admins on that repo.

https://github.com/wmde/wbaas-deploy/pull/960 is up which is a commit that fixes up the naming of the values files (from honey to 1) and also deploys the new release to the local dev configuration

Is this now ready for review? If so I suggest that whoever does the review of this PR also does a review of the "inital commit" in the image repo.

Andrew-WMDE updated the task description. (Show Details)Jun 29 2023, 2:37 PM

Is this now ready for review?

Yes, let's merge the local cluster changes before moving on to staging and production.

Andrew-WMDE removed Andrew-WMDE as the assignee of this task.Jun 29 2023, 2:51 PM

Andrew-WMDE moved this task from Doing to In Review on the Wikibase Cloud (Kanban board Q2 2023) board.

Andrew-WMDE updated the task description. (Show Details)Jun 29 2023, 2:55 PM

Evelien_WMDE moved this task from Kanban board Q2 2023 to Kanban board Q3 2023 on the Wikibase Cloud board.Jun 30 2023, 1:01 PM

Evelien_WMDE edited projects, added Wikibase Cloud (Kanban board Q3 2023); removed Wikibase Cloud (Kanban board Q2 2023).

Evelien_WMDE moved this task from To do to In Review on the Wikibase Cloud (Kanban board Q3 2023) board.

Rosalie_WMDE claimed this task.Jul 3 2023, 1:08 PM

Tarrow moved this task from In Review to Waiting for Deploy to Staging on the Wikibase Cloud (Kanban board Q3 2023) board.Jul 4 2023, 9:25 AM

Rosalie_WMDE removed Rosalie_WMDE as the assignee of this task.Jul 4 2023, 9:44 AM

Rosalie_WMDE subscribed.

Deniz_WMDE claimed this task.Jul 6 2023, 8:43 AM

Tarrow mentioned this in T335854: Configure MediaWiki to support two Elasticsearch clusters.Jul 6 2023, 9:38 AM

PR for staging: https://github.com/wmde/wbaas-deploy/pull/978

In T335853#8994122, @Deniz_WMDE wrote:

PR for staging: https://github.com/wmde/wbaas-deploy/pull/978

This attempt failed, because the resources available for the staging cluster were to little.

Here is a new PR with tweaked resource definitions, but I think we have to increase staging node resources to make this work: https://github.com/wmde/wbaas-deploy/pull/980

Also, the old 7.10.2 release was still hanging around on staging, probably because we missed to uninstall them previously. I removed them via $ kubectl delete statefulsets.apps elasticsearch-honey-master, because re-introducing the release and setting installed: false didn't work: https://phabricator.wikimedia.org/P49523

uninstall.go:97: [debug] uninstall: Deleting elasticsearch-honey
uninstall.go:119: [debug] uninstall: Failed to delete release: [unable to build kubernetes objects for delete: resource mapping not found for name: "elasticsearch-honey-master-pdb" namespace: "" from "": no matches for kind "PodDisruptionBudget" in version "policy/v1beta1"
ensure CRDs are installed first]
Error: failed to delete release: elasticsearch-honey
helm.go:84: [debug] failed to delete release: elasticsearch-honey

Deniz_WMDE claimed this task.Jul 10 2023, 11:00 AM

Following up on the insufficient resources on the staging cluster, here is a quick comparison (with estimated prices) of options I picked, alongside current references:

(all costs estimated with the GCP billing workload estimate feature for GCP project wikibase-cloud: link)

Reference - current machine types:

machine type	CPU	RAM	cost per 3 nodes	note
n2-highmem-16	16	128	2,705.75 €	currently in use for production cluster nodes
n2-standard-4	4	16	501.45 €	currently in use for staging cluster nodes

Suggested alternatives for the staging cluster:

machine type	CPU	RAM	cost per 3 nodes	note
e2-standard-8	8	32	691.93 €	CPU selection based on availability
n2d-standard-8	8	32	872.48 €	AMD EPYC CPU
n2-standard-8	8	32	1,002.90 €	Intel Cascade Lake and Ice Lake CPUs

My personal conclusion

Since (currently) performance isn't an important aspect for the staging cluster, I'd suggest to go with the cheapest machine type from this list e2-standard-8, which would result in (very roughly speaking) double the resources for ~40% more of the costs.

About temporarily scaling: If my understanding is correct, the way we'd have to go in production would be the following: add 3 nodes of the same type to the cluster until we can tear down ES 6, then remove the 3 nodes again. Did you consider this option as well?

Fring claimed this task.Jul 13 2023, 10:38 AM