We want to have a functioning Search cluster in codfw. We assumed we want an AP system, so we'll keep the two clusters decoupled and what will happen is:
- Any Cirrus job is enqueued and writes to both DCs [1]
- If a job on one DC fails, re-enqueue just that job
[1] How to do this is debatable: if we do the parsing once and just make the jobrunners in the primary dc talk to the ElasticSearch cluster, we spare quite a few resources, but we have an higher network traffic. If we spawn a job on the secondary DC jobqueue instead, it will be a bit more complex to manage and we use more resources, but we will save network bandwidth. T105705 is related to this.
Apart from design decisions, the steps here will be:
- Procure the hardware - 24 of the nicest servers we have in eqiad for search (?)
- Set up the hardware in mutliple rows/racks
- Maybe throw in 3 small/old spares as master-only nodes?
- Puppet - check the puppet code for ''eqiadisms''
- Actually implement the job changes to write to both datacenters.