Support multiple datacenters in CirrusSearch
Closed, ResolvedPublic
Actions

Description

Sometimes you have lots of servers and have to put them in different datacenters. Cirrus should be multi-DC aware, especially on writes.

A multi-write approach via the jobqueue (with the dest. cluster in the params) would probably work

Details

Subject	Repo	Branch	Lines +/-
refactor out connection singleton	mediawiki/extensions/CirrusSearch	REL1_26	+365 -306
refactor out connection singleton	mediawiki/extensions/CirrusSearch	master	+365 -306
Enable communication with multiple datacenters	mediawiki/extensions/CirrusSearch	master	+332 -89
Remove connection singleton	mediawiki/extensions/Elastica	master	+16 -32

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• Deskana	T105703 Set up a CirrusSearch cluster in codfw (Dallas, Texas)
		Resolved		EBernhardson	T86781 Support multiple datacenters in CirrusSearch

Event Timeline

• demon created this task.Jan 14 2015, 5:12 PM

• demon raised the priority of this task from to Medium.

• demon updated the task description. (Show Details)

• demon added projects: MediaWiki-Core-Team, CirrusSearch.

• demon added subscribers: • demon, • Manybubbles.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 14 2015, 5:12 PM

bd808 edited projects, added Discovery-ARCHIVED; removed MediaWiki-Core-Team.Apr 7 2015, 4:56 PM

• Manybubbles moved this task from Needs triage to Search on the Discovery-ARCHIVED board.May 7 2015, 7:59 PM

I _think_ the right way to do this is to wrap all of our write operations in jobs. Then we can attach a target cluster to those jobs. Then we can either call them in process with the jobs that build the parameters to the write operations _or_ we can pitch them into the job queue and let them happen. We can do the main DC in process and the secondary through the job queue. Or both on the queue.

• Manybubbles mentioned this in T99244: CirrusSearch: Support pausing writes to Elasticsearch.May 15 2015, 2:40 PM

This seems sane, the part I'm not clear on though is what update jobs are triggered through the secondary data center? Perhaps your just looking farther forward than WMF's current plan, but as i understand it the secondary datacenter will not be performing any write operations, only read. If one datacenter falls over the secondary will become the primary, but at that point again the writes are only happening in the primary datacenter and no jobs that trigger elasticsearch write operations should be triggered in the secondary datacenter.

After briefly talking to manybubbles yesterday it sounds like the idea here is:

A) There will be independent elasticsearch clusters in each datacenter.
B) Whichever datacenter does the writes will push jobs into the queues of remote datacenters to update their indexes

In T86781#1296475, @EBernhardson wrote:

B) Whichever datacenter does the writes will push jobs into the queues of remote datacenters to update their indexes

If we have more than one queue. If the queue is shared across both DCs then its fine - all that matters is that we:

Have the ability to write to one DC while the other is down.
Can catch up when the other comes back up.

My idea for that was to wrap all write operations in their own job and each job would have the target elasticsearch cluster as a parameter. We could just execute both jobs immediately in process and catch failures - if any fail we can queue them. Or we could queue the secondary DC's writes always. Something like that.

Talked briefly to Aaron, it sounds like decisions about multi dc are still in the ideation stage but its likely we will have a job queue per DC.

How chatty are the cirrus updates? It doesn't look too chatty and so should be ok for writes over the WAN, but i'm not entirely sure. I'm more concerned with the current implementation of the ElasticaConnection, For the sake of simplicity it might be clearer and easier if the configuration for a DC only knows how to talk to the ES cluster in its own DC.

I'm work on pausing writes first though, and then will come back to this.

dcausse mentioned this in T105184: Parallelize the theory-testing pipeline.Jul 9 2015, 9:53 AM

• chasemp merged a task: T105709: Implement multi-DC support in CirrusSearch.Jul 22 2015, 10:39 PM

• chasemp added subscribers: Joe, Matanya, dcausse, • Gage.

• chasemp added a parent task: T105703: Set up a CirrusSearch cluster in codfw (Dallas, Texas).Jul 22 2015, 10:41 PM

EBernhardson merged a task: T105709: Implement multi-DC support in CirrusSearch.Jul 22 2015, 11:54 PM

• chasemp merged a task: T109734: enable cirrussearch to talk to two clusters.Sep 15 2015, 10:07 PM

• chasemp added subscribers: • Deskana, gerritbot, • chasemp.

EBernhardson claimed this task.Sep 16 2015, 7:55 PM

EBernhardson added a project: Discovery-Search (Current work).

EBernhardson set Security to None.

EBernhardson moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

Change 235149 had a related patch set uploaded (by Deskana):
refactor out connection singleton

https://gerrit.wikimedia.org/r/235149

Change 235175 had a related patch set uploaded (by Deskana):
Remove connection singleton

https://gerrit.wikimedia.org/r/235175

EBernhardson moved this task from Needs review to not in use - please delete on the Discovery-Search (Current work) board.Sep 21 2015, 11:29 PM

Change 235175 merged by jenkins-bot:
Remove connection singleton

https://gerrit.wikimedia.org/r/235175

• Deskana mentioned this in rEELA726d1c02048b: Remove connection singleton.Sep 23 2015, 12:26 PM

EBernhardson mentioned this in rMEXT13d13ec415f5: Updated mediawiki/extensions Project: mediawiki/extensions/Elastica….Sep 23 2015, 12:26 PM

Change 235149 merged by jenkins-bot:
refactor out connection singleton

https://gerrit.wikimedia.org/r/235149

EBernhardson mentioned this in rMEXT9feeb96a0248: Updated mediawiki/extensions Project: mediawiki/extensions/CirrusSearch….Sep 23 2015, 12:29 PM

• chasemp mentioned this in rECIR8d21229c4368: refactor out connection singleton.Sep 23 2015, 12:30 PM

• ksmith added a project: OKR-Work.Sep 24 2015, 4:39 PM

Change 237264 had a related patch set uploaded (by EBernhardson):
Enable communication with multiple datacenters

https://gerrit.wikimedia.org/r/237264

Change 237264 had a related patch set uploaded (by EBernhardson):
Enable communication with multiple datacenters

https://gerrit.wikimedia.org/r/237264

EBernhardson moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Sep 28 2015, 6:40 PM

ReleaseTaggerBot added a project: MW-1.27-release (WMF-deploy-2015-10-06_(1.27.0-wmf.2)).Sep 29 2015, 7:57 PM

Jdforrester-WMF edited projects, added MW-1.27-release (WMF-deploy-2015-09-29_(1.27.0-wmf.1)); removed MW-1.27-release (WMF-deploy-2015-10-06_(1.27.0-wmf.2)).Sep 29 2015, 7:58 PM

Change 237264 merged by jenkins-bot:
Enable communication with multiple datacenters

https://gerrit.wikimedia.org/r/237264

EBernhardson mentioned this in rMEXTe24172d7666c: Updated mediawiki/extensions Project: mediawiki/extensions/CirrusSearch….Oct 5 2015, 3:26 PM

dcausse mentioned this in rECIRdb2ac21e7538: Enable communication with multiple datacenters.Oct 5 2015, 3:36 PM

ReleaseTaggerBot added a project: MW-1.27-release (WMF-deploy-2015-10-06_(1.27.0-wmf.2)).Oct 5 2015, 4:00 PM

Deployed a patch to mediawiki-config, testwiki is now writing to both the standard eqiad cluster and the labsearch (single node) cluster. I'll turn on codfw tomorrow.

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Oct 20 2015, 4:36 PM

• Deskana closed this task as Resolved.Oct 27 2015, 8:46 AM

• Deskana moved this task from Needs Reporting to Resolved on the Discovery-Search (Current work) board.

Change 255934 had a related patch set uploaded (by Reedy):
refactor out connection singleton

https://gerrit.wikimedia.org/r/255934

Change 255934 merged by jenkins-bot:
refactor out connection singleton

https://gerrit.wikimedia.org/r/255934

• Deskana moved this task from Inbox to Resolved/Invalid/Declined/Legacy on the CirrusSearch board.Dec 31 2015, 5:10 AM