We have a good mix of storage systems (mariadb primaries, mariadb external stores, swift, redis (main stash), cassandra.
For stores that are strongly consistent over DCs (we don't have many), there isn't much to do. Though such stores would only be in two DCs, which is odd for an HA system.
For last-write-wins stores like cassandra, we need to know:
a) How much data might be lost on emergency switch-over? (LOCAL writes in cassandra for example, can we get a sense of the lag of getting data to remote DCs?)
b) If "newer" data at the time that was lost is visible again and then replicates, does that hurt anything?
c) Are any functional data dependencies assumed between these stores and other stores that matter (e.g. mariadb rows that have corresponding restbase entries)
d) If QUORUM reads/writes are used, they might fail if 1 of 2 DCs are down...that implies a need for switch over...
e) How do we go read only and "wait for things to catch up" with scripts?
For eventually consistent stores like swift + swiftrepl, we need to know:
a) How much data might be lost on switch over (can we have a sense of "lag times", hard to do since there is no replication log)
b) Are any functional data dependencies assumed between these stores and other stores that matter (e.g. mariadb rows that have corresponding restbase entries)
c) What do we need for fast master/slave cluster switchover?
d) How do reconcile swift DELETEs safely without tombstones? Should MW use temp 404 tombstones? Would SHA1 original paths help?
e) How do we go read only and "wait for things to catch up" with scripts?
For replicated stores like mariadb slaves, we need to know:
a) How much data might be lost on switch over (how robust is "too lagged; read only" mode?)
b) Are any functional data dependencies assumed between these stores and other stores that matter (e.g. mariadb rows that have corresponding restbase/ES entries)
c) How do we go read only and "wait for things to catch up" with scripts?
For caches (memcached/CDN), we need to make sure:
a) they are not used to determine writes
b) if failover causes some recent data to be lost, then bogus entries might matter; should WAN cache support generation time range blacklisting? do we also do CDN bans based on timestamp?
c) what kind of bogus entries we are willing to tolerate
Since some stores have several users ("calling code"), we need to check each. For example, users of FileBackend::doQuickOperations() can tolerate sloppy LWW logic and the occasional DELETE due to swift-repl changing direction. For FileBackend::doOperations() calls, like originals, we want to be more careful (like use 404 tombstones to know a missing file was deleted and not just missing due to switchover).
A general issue is also that of config. MediaWiki has lots of config deciding what DB/swift to talk to. If stale config is still running in prod due to network partitions (breaking scap and so on), we can still have problems. One option is to have config switches around "what DC is active". The active DC name could be pulled from a file, managed by ecd. We could have apaches go read only if that file is out of contact with ecd. I'll spin this off mostly to another task: T114273.
Some notes archived at https://etherpad.wikimedia.org/p/multi-dc-mediawiki