The alert `ToolsDBReplicationLagIsTooHigh` triggers about once a month, and usually takes a few days to recover:
*{T368250}
*{T357979}
*{T357264}
*{T355411}
*{T345450}
*{T343819}
*{T341891}
*{T338031}
Each task contains the database that caused the issue, so far they were:
* `s54113__spacemedia` (twice)
* `s51434__mixnmatch_p` (once)
* `s51698__yetkin` (3 times)
* `s55593__PAGES` (once)
* `s55462__imagehashpublic_p` (once)
This is generally caused by big deletes in a single transaction. Using row-based replication (RBR) these deletes translate into thousands of events to replicate.
We should investigate what we can do to make this less frequent. One option is to add indexes to columns affected by the deletes. Another is to migrate the databases causing this issue to dedicated Trove instances (related: {T291782}).