eqiad: 3x VM request for new opensearch cluster
Open, MediumPublic
Actions

Assigned To

None

Authored By

	bking
	Apr 8 2024, 7:32 PM

Description

Site/Location: eqiad
Number of systems: 3
Service: Mutualized Opensearch, a net-new shared opensearch cluster
for light-to-medium use cases
Networking Requirements: private vlan
Processor Requirements: 4 vCPU
Memory: 8 GB RAM
Disks: 250 GB
Other Requirements: DRDB disabled

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T362105 EPIC: Mutualized opensearch cluster
		Open		None	T362107 eqiad: 3x VM request for new opensearch cluster

Event Timeline

bking created this task.Apr 8 2024, 7:32 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 8 2024, 7:33 PM

bking added a project: Data-Platform-SRE.Apr 8 2024, 7:33 PM

bking added a parent task: T362105: EPIC: Mutualized opensearch cluster.

bking updated the task description. (Show Details)Apr 8 2024, 8:19 PM

Reedy renamed this task from eqiad: 3x VM for new opensearch cluster to eqiad: 3x VM request for new opensearch cluster.Apr 9 2024, 8:11 AM

Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.Apr 15 2024, 1:04 PM

Gehel triaged this task as Medium priority.Apr 15 2024, 1:05 PM

Gehel moved this task from Scratch to Infrastructure on the Data-Platform-SRE board.Apr 15 2024, 1:08 PM

Looks good to me.

But better initially set these up with the default DRBD settings and check if the I/O performance is sufficient. The need to disable Ganeti is really only for I/O latency (etcd is very sensitive to that), but throughput should be fine in practically all cases. Not disabling DRBD will simplify maintenance of the VMs in general (1. they won't go down if a virt node is rebooted and 2. if we upgrade a virt node we'll need to temporarily move them back to DRBD to evict them from the node for the reimage for a bit)

@MoritzMuehlenhoff Thanks for the feedback, you've given me some food for thought. Here are my thoughts:

Like etcd, Opensearch is a distributed database application. So the I/O needs are likely to be similar.
Again like etcd, Opensearch has fault tolerance at the application level. So concerns about rebooting hypervisors or even wiping out VMs completely are minimal (especially if we were to add 2 more VMs).
My understanding (and please correct me if I'm wrong) is that using DRDB ties up resources on the backup node as well, especially disk space.

So if the choice is between:

Use 2x the requested disk resources above and accept an I/O penalty 100% of the time
Run 3 (or 5) VMs with full I/O performance, less overall disk space used, risking reboots/data loss on a single node

I definitely prefer option 2. If we went for option 2, your team would be under no obligation to notify us ahead of time for reboots. We would like a heads-up for reimages, particularly if we only have a 3-node cluster, but in general we have no need for fault tolerance at the VM level. Let me know your thoughts on this matter.

bking added a subscriber: RKemper.Wed, May 1, 3:15 PM

eqiad: 3x VM request for new opensearch clusterOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

eqiad: 3x VM request for new opensearch cluster
Open, MediumPublic
Actions

Related Objects
Search...