How to do this is an open question. Some thoughts:
- Detect a gate-and-submit failure coming after a V+2
- Report the number of "recheck" change comments. We might also compare the test results before and after recheck, and any difference counts as flapping.
- Somehow track individual failing test names, and show a heat map of the most frequent failures.
