As discussed at T306118#8177779, as a followup to T315274, it would be nice to do a test of x2 failure modes in production in order to gain confidence with the multi-DC deployment.
After deployment of the dbctl patch which will remove the x2 replicas from MediaWiki's configuration, there should be no need to test replication failure on the leaf nodes, since MW will have no way to connect to them. But we can test stopped and delayed replication on the codfw x2 master (db2142). We could also test connection failures.
The idea would be:
- Set multi-DC mode to testwiki only
- Disable paging alerts for db2142, db2143, db2144
- Stop replication on db2142
- Try some page views on testwiki, monitor the logs
- Start replication on db2142 with MASTER_DELAY=30, repeat tests.
- Restore normal replication on db2142.
- Simulate db2142 failure with iptables -I INPUT -p tcp --syn --src 10.192.0.0/16 --dport 3306 -j DROP. Fighting ferm, but it only needs to be deployed for a few minutes. Repeat tests. Restore with iptables -D INPUT 1.
- Repeat test with -j REJECT
- Restore alerts.
This could be done in the AU/EU overlap on Monday 5 Sep, assuming the dbctl patch is deployed by then.