We seem to constantly be fighting a battle against flakey browser tets.
We go through a few prolonged cycles in this area which just make the situation worse all around:
Assuming they are random, and rechecking
- My CI fails for a seemingly flakey reason
- type recheck re running ALL jobs (not only the failed on), probably expending 30-60 minutes worth of execution time on CI nodes?
- the recheck turned it green, so the issue doesn't get flagged up investigated and fixed
Failing to report issues & save needed details
- Report a seemingly flakey browser test in CI at a high level, linking to logs for CI runs
- The team that needs to look at the failure may not look for a week or 2 as the ticket goes through their process
- By time the team looks at the ticket the links to CI builds are dead and the investigation is pretty hard / not worth it at that point?
- Wait for the process to repeat?
I propose that we experiment with looking at seemingly flaky / failing browser tests centrally in the jenkins logs.
I would hope that:
- We can catch "trending" issues before people would normally report them
- We can look at issues with multiple runs and logs being automatically provided to us (instead of waiting for people to report more failure in phabricator)
- We diagnose and fix the issues faster
- Everyone is happier, and we reduce the painful loops that I mention above,
Collection of data:
(Attempt to do this weekly?)
ssh contint2001.wikimedia.org find /srv/jenkins/builds/*-selenium-*/*/log -maxdepth 1 -mtime -7 | grep -v apache | xargs cat | grep β | sort | uniq -c | sort -nr
And then replace \s+(\d+).*(β.*)\n? with $1, $2\n
And then add the data into https://docs.google.com/spreadsheets/d/1zllaM9T7RxOF29zu5SN0KM2eqtNdWfqfnXkOYE3XpUo/edit#gid=787369492 with the current date