Page MenuHomePhabricator

Figure out a replication strategy for ElasticSearch
Closed, ResolvedPublic


It might also be worth seeing how hard it would be to route searches to the master DC as a stop-gap.

Event Timeline

aaron created this task.Mar 7 2015, 11:14 AM
aaron claimed this task.
aaron raised the priority of this task from to Normal.
aaron updated the task description. (Show Details)
aaron added subscribers: Krenair, PleaseStand, gerritbot and 3 others.
Krenair set Security to None.
Gilles added a subscriber: Gilles.Apr 2 2015, 12:24 PM

@erik: what is the status of the work done on this so far (which seems to have been a fair amount)?

EBernhardson closed this task as Resolved.Nov 13 2015, 7:46 AM

This is basically complete. The second cluster is up and taking the full write load of all wikipedia's. The strategy is to create jobs that represent individual writes to elasticsearch. These jobs are run in process of another job, if there is any kind of failure due to a network partition or maintenance they get written out to the job queue. These jobs are retried with an exponential backoff between 30s and 20 minutes. If a job fails after more than 3 hours after the original write request it is dropped and logged to the CirrusSearchChangeFailed channel. These writes can be manually applied later by running forceSearchIndex.php maint script with parameters specifying the cluster and time period to reindex.