T160832 - It took me not much more than 20 minutes from a server issue until that was mitigated on mediawiki level.
No problems AFAICT with the load balancer- it did its job womderfully.
However, during those 20 minutes, 5 million logs were registered, mostly at the DBReplication channel, complaining about db1094. The replication check is crazy- the architecture should be to check at most once per second (much better if it was less than that, speciall after failure. I understand the difficulty of the application server architecture to coordinate that, but I think there could be ways to mitigate that, like some kind of short-term caching or it being shared (or increase its TTL if that is already in place).
This is not the only issue- DBReplication generates an error if the lag is greater than 1 second- even if the replication check has a >1 second of error in its calculation. The threshold should be on something like 2 second notice, 5 second warning, 15 seconds error (or the amount configured on mediawiki).