Change Details

We seem to constantly be fighting a battle against flakey browser tets. We go through a few prolonged cycles in this area which just make the situation worse all around: **Assuming they are random, and rechecking** - My CI fails for a seemingly flakey reason - type `recheck` re running ALL jobs (not only the failed on), probably expending 30-60 minutes worth of execution time on CI nodes? - the `recheck` turned it green, so the issue doesn't get flagged up investigated and fixed **Failing to report issues & save needed details** - Report a seemingly flakey browser test in CI at a high level, linking to logs for CI runs - The team that needs to look at the failure may not look for a week or 2 as the ticket goes through their process - By time the team looks at the ticket the links to CI builds are dead and the investigation is pretty hard / not worth it at that point? - Wait for the process to repeat? I propose that we experiment with looking at seemingly flaky / failing browser tests centrally in the jenkins logs. I would hope that: - We can catch "trending" issues before people would normally report them - We can look at issues with multiple runs and logs being automatically provided to us (instead of waiting for people to report more failure in phabricator) - We diagnose and fix the issues faster - Everyone is happier, and we reduce the painful loops that I mention above, Collection of data: ``` ssh contint2001.wikimedia.org ls /srv/jenkins/builds/*-selenium-*/*/log | grep -v apache | xargs cat | grep ✖ | sort | uniq -c | sort -nr ``` And then replace `\s+(\d+).*(✖.*)\n` with `$1, $2\n`