We seem to constantly be fighting a battle against flakey browser tets.
We go through a few prolonged cycles in this area which just make the situation worse all around:
Assuming they are random, and rechecking
- My CI fails for a seemingly flakey reason
- type recheck re running ALL jobs (not only the failed on), probably expending 30-60 minutes worth of execution time on CI nodes?
- the recheck turned it green, so the issue doesn't get flagged up investigated and fixed
Failing to report issues & save needed details
- Report a seemingly flakey browser test in CI at a high level, linking to logs for CI runs
- The team that needs to look at the failure may not look for a week or 2 as the ticket goes through their process
- By time the team looks at the ticket the links to CI builds are dead and the investigation is pretty hard / not worth it at that point?
- Wait for the process to repeat?
I propose that we experiment with looking at seemingly flaky / failing browser tests centrally in the jenkins logs.
I would hope that:
- We can catch "trending" issues before people would normally report them
- We can look at issues with multiple runs and logs being automatically provided to us (instead of waiting for people to report more failure in phabricator)
- We diagnose and fix the issues faster
- Everyone is happier, and we reduce the painful loops that I mention above,