Page MenuHomePhabricator

Support multiple datacenters in CirrusSearch
Closed, ResolvedPublic

Description

Sometimes you have lots of servers and have to put them in different datacenters. Cirrus should be multi-DC aware, especially on writes.

A multi-write approach via the jobqueue (with the dest. cluster in the params) would probably work

Event Timeline

demon raised the priority of this task from to Medium.
demon updated the task description. (Show Details)
demon added subscribers: demon, Manybubbles.

I _think_ the right way to do this is to wrap all of our write operations in jobs. Then we can attach a target cluster to those jobs. Then we can either call them in process with the jobs that build the parameters to the write operations _or_ we can pitch them into the job queue and let them happen. We can do the main DC in process and the secondary through the job queue. Or both on the queue.

This seems sane, the part I'm not clear on though is what update jobs are triggered through the secondary data center? Perhaps your just looking farther forward than WMF's current plan, but as i understand it the secondary datacenter will not be performing any write operations, only read. If one datacenter falls over the secondary will become the primary, but at that point again the writes are only happening in the primary datacenter and no jobs that trigger elasticsearch write operations should be triggered in the secondary datacenter.

After briefly talking to manybubbles yesterday it sounds like the idea here is:

A) There will be independent elasticsearch clusters in each datacenter.
B) Whichever datacenter does the writes will push jobs into the queues of remote datacenters to update their indexes

B) Whichever datacenter does the writes will push jobs into the queues of remote datacenters to update their indexes

If we have more than one queue. If the queue is shared across both DCs then its fine - all that matters is that we:

  1. Have the ability to write to one DC while the other is down.
  2. Can catch up when the other comes back up.

My idea for that was to wrap all write operations in their own job and each job would have the target elasticsearch cluster as a parameter. We could just execute both jobs immediately in process and catch failures - if any fail we can queue them. Or we could queue the secondary DC's writes always. Something like that.

Talked briefly to Aaron, it sounds like decisions about multi dc are still in the ideation stage but its likely we will have a job queue per DC.

How chatty are the cirrus updates? It doesn't look too chatty and so should be ok for writes over the WAN, but i'm not entirely sure. I'm more concerned with the current implementation of the ElasticaConnection, For the sake of simplicity it might be clearer and easier if the configuration for a DC only knows how to talk to the ES cluster in its own DC.

I'm work on pausing writes first though, and then will come back to this.

Change 235149 had a related patch set uploaded (by Deskana):
refactor out connection singleton

https://gerrit.wikimedia.org/r/235149

Change 235175 had a related patch set uploaded (by Deskana):
Remove connection singleton

https://gerrit.wikimedia.org/r/235175

Change 235175 merged by jenkins-bot:
Remove connection singleton

https://gerrit.wikimedia.org/r/235175

Change 235149 merged by jenkins-bot:
refactor out connection singleton

https://gerrit.wikimedia.org/r/235149

Change 237264 had a related patch set uploaded (by EBernhardson):
Enable communication with multiple datacenters

https://gerrit.wikimedia.org/r/237264

Change 237264 had a related patch set uploaded (by EBernhardson):
Enable communication with multiple datacenters

https://gerrit.wikimedia.org/r/237264

Change 237264 merged by jenkins-bot:
Enable communication with multiple datacenters

https://gerrit.wikimedia.org/r/237264

Deployed a patch to mediawiki-config, testwiki is now writing to both the standard eqiad cluster and the labsearch (single node) cluster. I'll turn on codfw tomorrow.

Change 255934 had a related patch set uploaded (by Reedy):
refactor out connection singleton

https://gerrit.wikimedia.org/r/255934

Change 255934 merged by jenkins-bot:
refactor out connection singleton

https://gerrit.wikimedia.org/r/255934