Problem
In order to be able to serve read requests from both DCs, on the MediaWiki side of things we need to ensure the replication lag is minimised between the two DCs. In this context, using GTID has proven problematic, so the current thinking is to deprecate it in favour of pt-heartbeat.
Proposal
Currently, pt-heartbeat is used only for monitoring the lag. We would need to switch to using it MW as a mechanism for waiting on replicas to catch up. Moreover, we would need to introduce the notion of multiple DCs and "distant replicas" so that MW code can decide/check whether it needs to wait on all replicas to catch up or only the ones in the local DC. This would then be used in background update jobs, such as refreshlinks, to finish the execution only once all of the replicas have been updated. This allows us to ensure minimal differences for reads between the two DCs.
Open Questions
- Would it be acceptable for Chronology Protector to have a longer wait for content in cases where GETs for the current user are served from the secondary DC?
- How to address the failure scenarios, such as split brain or network connectivity issues between the two DCs?