Goal definition
Specific: What do we want to achieve?
Automated UI regression testing alerting us when regressions have occurred.
Measurable: How will we know when we've reached our goal?
We will know we're done when the team is notified bout our first UI regression (either by an intentionally introduced error to prove it works or an accidental one)
Achievable: What support will we need to achieve our goal?
Should be answered by T304634.
Relevant: Is this goal worthwhile?
Yes. Automated UI regression testing will catch many more important issues than manual testing. This will reduce regressions and reduce friction between teams due to bugs.
Background
There is a complex interplay between Mediawiki core code, skins, gadgets, extensions, and other customizations that could sometimes make it difficult to anticipate the full extent of the effect of a change to code maintained by the Foundation. Some gadgets and CommonJS snippets live outside of our codebases but are tightly coupled to the behaviour of code in Mediawiki. This sometimes leads to seemingly innocuous changes eventually resulting in visible problems when rolled out to all the Wikis, in spite of due diligence and manual testing before release.
Hypothesis
At some point, given the number of people contributing code to Mediawiki and the complexity of all the mechanisms that allow customization, we need mechanisms other than due diligence and manual testing to ensure that we aren’t unknowingly breaking any significant functionality. The proposal in this document is predicated on this hypothesis being true. One method that might help us detect visual regressions earlier and more efficiently is automated visual regression testing.
Technical Problems to Solve
Visual regression testing at its very minimum, compares a specified version (likely a release candidate, or a git client with a feature under development, let's call it the "test version") with another version as a baseline (likely a previous release candidate or whatever is running in production), runs through certain scenarios, and reports on any visual differences spotted between the baseline and the test version. While this basic premise is simple enough, some details deserve more attention.
Running at Scale
The most important objective of a visual regression test is to assure the initiator of the test that their release is unlikely to break any features when it gets launched. To this end, the test needs to run at scale - on potentially hundreds of pages, as per the desired scale and the modality of the test. A simple way of getting started here would be to use Chrome and Selenium Webdriver.
Reducing Noise
One problem often encountered while running visual regression tests at scale is noise. This could be due to many different reasons, the most common of which are a) low tolerances and b) lack of a de-duplication of reported visual regressions. (a) occurs in cases where small, almost imperceptible differences such as antialiasing or font kerning settings end up being reported as visual regressions. Setting up tolerances is usually helpful in this case. A single change causing a large number of regressions sometimes produces hundreds of failures, making it tiresome to sift through them all. Having a mechanism to de-duplicate and group these regressions by possible cause will make it significantly easier for a test runner to properly understand the report and take action. Using an impage comparison library would be a first step to solving tolerances.
Hermetic Tests
To ensure that the tests are hermetic, a sufficiently complete environment including production data and configuration needs to be spun up for the test and the baseline instances of Mediawiki and its extensions. Patchdemo might have ideas worth learning from.
Production Data
We need to ensure that a sufficiently large corpus of production data is copied out into the database used for running the regression. This will need to include complete pages as well as the images and other media shown within.
Production Config
To ensure that the production environment is replicated as accurately as possible, it might also be necessary to find the LocalSettings.php and other configuration files will need to somehow be replicated.