It might also be worth seeing how hard it would be to route searches to the master DC as a stop-gap.
This is basically complete. The second cluster is up and taking the full write load of all wikipedia's. The strategy is to create jobs that represent individual writes to elasticsearch. These jobs are run in process of another job, if there is any kind of failure due to a network partition or maintenance they get written out to the job queue. These jobs are retried with an exponential backoff between 30s and 20 minutes. If a job fails after more than 3 hours after the original write request it is dropped and logged to the CirrusSearchChangeFailed channel. These writes can be manually applied later by running forceSearchIndex.php maint script with parameters specifying the cluster and time period to reindex.