The s8 incident T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") was caused by a replication gap (which is still under investigation).
This gap is hard to detect if replication doesn't get broken (which should have happened, but so far we don't know why it didn't).
There are plans to automatically check tables for data differences as part of T104459: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves, if we'd have had a process that checks some tables for differences, we could have detected that some tables had differences, and an investigation would have probably started.
The idea is to check some important (but not to big tables) that are likely to have many changes during the day, so we can detect differences.
We could start with the user table on one wiki per section.
This table is concurrent enough but not too big in size that it can be compared in less than an hour for enwiki for example and even if there are schema changes running on that table, they would not be big enough that they can take more than 24h to be completed.
We should use compare.py with non too aggressive options so it can be done automatically and without any risks.
The first approach could be to compare one host per DC, maybe candidate masters of each DC once a day and if there are differences, send an email (the same way we do with the data checks on labs)
Right now there is no inventory where we can automatically detect and select the candidate masters, so the hosts to check should be hardcoded manually (or selected by grepping on db-eqiad.php and db-codfw.php).