Page MenuHomePhabricator

EPIC: Detect and prevent UI regressions
Closed, DeclinedPublic

Description

Problem

Various bugs hit our users in production due to UI regressions

  • T108641 - edit and watch icons shifted to the right
  • T108642 - toast messages lost centering
  • T159009 - a change to the UI header broke the language overlay

These are difficult to detect with unit tests or browser tests and currently rely on manual testing and a user reporting the bug.

We believe that by detecting UI regressions programatically we can avoid giving our users bad experiences.

How

Various tools exist that allow taking screenshots at certain points in the workflow. These can be done as part of existing browser tests or as standalone tests.

Using image diff comparison tools we can detect and report when UIs change.

Open questions

  • what level do we test? component level? in middle of Selenium browser tests?
  • when do we run? Daily/per commit
  • where to run (against bc/vagrant/reading web staging)

Tooling

Plan

  • Explore currently available tools that claim to help detect UI regressions (e.g. via screenshots) and pick one that looks like the best fit
  • ?

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
NHarateh_WMF renamed this task from Spike: Explore using webdrivercss for css regression testing to Spike: [8h] Explore using webdrivercss for css regression testing.May 4 2017, 5:38 PM
Jdlrobson renamed this task from Spike: [8h] Explore using webdrivercss for css regression testing to Spike: [8h] Explore using tools (e.g. webdrivercss) for css regression testing.May 4 2017, 5:40 PM

@zeljkofilipin I'd be clearly interested in this session too.

@Volker_E I will be more than glad to pair on webdriverio/webdrivercss any time during the hackathon, and of course during the official session. (I am not sure when the session will be scheduled.)

https://www.mediawiki.org/wiki/Parsing/Visual_Diff_Testing (and https://www.mediawiki.org/wiki/Parsoid/Visual_Diffs_Testing ) might also be relevant. We can run mass visual diff tests. https://github.com/wikimedia/integration-visualdiff is the code that is used for doing that and is based on PhantomJS .. it uses https://github.com/wikimedia/integration-uprightdiff as the diff engine ... but it can also run with node-resemble (not tested in a while) or code can be updated to plug in other diff engines. This may be a bit heavyweight for your needs, but just thought I would put it out there.

it seems like Selenium browser tests would be the ideal place to do this. I see it already takes screenshots (which show up when there are browser test failures). I'm wondering if there's some way to store them and do a visual diff of new screenshots vs old.

Ideally, we'd want a non-voting job on Jenkins which alerts the user that certain UIs have changed since last run.
Failing that a daily job similar to the existing Jenkins browser tests that run against beta clusters.

I'm going to start poking around at this tomorrow so if anyone has thought about this before I'd appreciate a brain dump/links.

All of these projects have been active this year:

Casperjs + PhantomCSS
Phantomjs
We'd have to write separate tests that check parts of the workflow. I worry about duplication of the existing Selenium browser tests by doing this so it doesn't feel like the right solution.

BackstopJS
High number of forks/stars/watches. Requires CasperJS.

Wraith
High number of forks/stars/watches also uses Casperjs

Spectre
Ruby
Looks like it could be integrated into Selenium
Interesting motivations https://medium.com/friday-people/how-we-do-visual-regression-testing-af63fa8b8eb1 "Separate tools for taking screenshots and reporting differences. " really resonates with me. It would be good to be able to feed in screenshots from different places... especially given we might move from Ruby selenium browser tests to Node.
Basically whatever the runner is (e.g. Selenium) submits a screenshot to an API. We'd then use the spectre tool (Ruby on Rails) to monitor changes (so they'd be another tool to check daily).
Lower number of forks/stars/watches

Visual Review
Low number of forks/stars/watches. Similar idea to Spectre but built for Protractor so probably not the right fit.

webdrivercss
Inspired by PhantomCSS. Fits neatly into Selenium tests. basically at any point in the test you take a screenshot
Would need to have browser tests written in Node.
Provided our existing Selenium jobs have the right storage capabilities, only subsequent builds would fail - given we run daily this means we'd trap any unwanted regressions a day after. Can specify a misMatchTolerance so can allow for a certain margin of error or none at all.

Huxley is archived so let's not consider that anymore.

At this point webdrivercss or Spectre seem like things that are worthy of exploration given our current stack.
We have extensive browser tests so taking screenshots during the test flows (maybe Then step) and uploading them somewhere seems like a good idea.

Selenium already takes screenshots when browser tests fail so I'm also wondering whether that means it takes screenshots already for all steps - in which case we could just do some visual diffing on those. @zeljkofilipin any idea?

Change 353902 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/RelatedArticles@master] WIP: Webdriver CSS regression test

https://gerrit.wikimedia.org/r/353902

This is really cool! FWIW, I sent an email (subject: "UI testing") about some of these options I had explored for fun a while ago. I don't have as much experience in this domain as you, so perhaps it won't be super useful. However, this quarter I'm continuing on with Gemini GUI which wraps Selenium (or SauceLabs or BrowserStack) for the wikimedia-page-library's screenshot tests. I hope Selenium will add Chrome headless support soon (right now I just use xvfb).

One other note from Android's experience in screenshot testing, I recommend versioning screenshots in a separate repo to keep the code repo size down. I also recommend adding a Git configuration for diffing images since Grrrit isn't always grrreat at it.

Not sure how I missed this but the future of webdrivercss is not too certain
At top of readme: "Note: WebdriverCSS isn't yet compatible with WebdriverIO v3.0 and is not longer maintained."
(https://github.com/webdriverio/webdrivercss/issues/178)

I'm going to take a look at https://github.com/zinserjan/wdio-visual-regression-service

It looks like you're actively working on this @Jdlrobson!

Change 353993 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/core@master] Add support for UI regression testing

https://gerrit.wikimedia.org/r/353993

Jdlrobson removed Jdlrobson as the assignee of this task.EditedMay 16 2017, 1:24 PM

I purposely didn't assign this to me (and said so in the standup threads) as I want to encourage conversation from entire team and I am concerned assigning suggests that I am going to work on this alone.

As a strawman proposal it seems the wdio-visual-regression-service could easily be integrated into Node browser tests. I've submitted a patches to demonstrate, but so far can only get that working locally.

I'd really appreciate some conversation related to my thoughts above and reactions to the POC patches especially from @pmiazga @phuedx @Jhernandez @bmansurov @zeljkofilipin

A few thoughts off the back of this work:

  • The downside of doing any UI regression tests is that are slow, so if we did adopt them we'd have to use them sparingly or have slower tests.
  • It's not clear how to deal with expected failures (e.g. redesigns). If a browser test relating to a UI regression fails do we force merge anyway?

/me sits back and awaits feedback.

I second @Niedzielski's suggestion as gemini seems to support multiple browsers, which is important for UI testing. It all supports Selenium and Browserstack, which is great as we don't have to deal with maintaining multiple browser environments.

Wraith claims to support multiple browsers too, but the link to its selenium fork is dead.

When looking at this before I saw https://open.segment.com/niffy which is a Perceptual diffing suite.

It runs tests and compare screenshots and fails if they are different, and creates a png with transparent overlays.

Particularly this solution:

  • It is for comparing two websites (no screenshot storing). So maybe staging or test vs prod, or localhost vs prod.
    • You can take actions when visiting a url (clicking stuff, hovering, waiting for stuff, etc)
  • Based on electron for the driven browser. It is not a cross browser verification suite like what webdriver.io could be used for.

@ssastry integration-visualdiff is for comparing two urls as loaded in the browser, right? No way to take actions before screenshotting (like clicking on things, filling inputs, etc)


I think the vienna session will be a good sync on this topic, maybe get a proof of concept working.

I think there are a few questions to clarify before tackling any specific solutions, to be clear on what we want to accomplish. Depending on the answers then the solution may change drastically or even be different things.

  • What do we want to avoid by adding regression testing? What will highlighting visual differences help us avoid/improve?
    • We want to catch and avoid unexpected changes on UI
      • Generally, or as a cross-browser test suite? (cross browser UI changes?)
  • How do we differentiate unexpected vs expected? Is this a human process?
    • Differentiating expected UI changes vs intended ones like design changes.
  • At which level do we want to catch differences?
    • At the component level: Something like screenshotting standalone UI elements
      • Most reliable and reproducible. Less realistic, can miss diffs that result of multiple components interacting.
    • At the project level: Something like screenshotting pages per patch on a projects CI job
      • Less reliability, but increased realism. Will catch in-project interactions that create diffs, not inter-project diffs.
    • At the deployment level: Something like screenshotting pages in a timely basis between deployments (local, staging, production)
      • Least reliability and reproducibility. Most realistic. Will catch inter-project interactions that create diffs (imagine extensions breaking each other, or core influencing changes on extensions) and all others above, but create the highest amount of false negatives.

Then consider on solution space:

  • Do we compare between two live versions or a live version vs past snapshots?
    • Live version vs live version:
      • Gives you more realism, comparing for example:
        • Beta cluster or test against the production wiki to catch regressions before production
        • Localhost against latest master live (beta cluster)
      • May trigger more false negatives and more noise/work
        • Things like central notice banners, or different extensions/content between live wikis may create diffs that are not really relevant
    • Stored snapshots vs live:
      • Probably less false negatives than comparing live versions
      • Faster, half the work is already stored
      • Screenshots may be out of date and highlight differences that are not relevant
      • You need to think where to store and update the screenshots
  • Should this be an automated process (machine rejects changes) or a more manual one (diffs are shown or sent to humans for evaluation)?
  • Is it necessary to reuse existing browser tests or is this a different thing?
    • During a browser test you may want to take screenshots at different intermediate points that are not important or relevant to the test itself.
    • Doing it separate may make the screenshot tests out of sync with the tests, but they would fail the same way the browser tests fail when outdated.

Maybe we can use this questions and others we think about for the vienna browser testing session and capture the thoughts.


My take would be scope the smallest project that gives us the most benefit, something like a daily job that runs some scenarios and screenshots beta cluster against production, and emails a report to the qa alerts list if there are differences. Then evaluate where to go from there.


@Niedzielski can you expand on the separate repo for storing screenshots method, and how you keep that repo updated?

Thanks @Jhernandez the task as vageuly written was about exploring the tooling and my conclusion here is that there exists many tools and many are suitable so yes at this point it really is now a case of deciding what we want to solve for.

I think running a standalone job or on every commit as part of the existing jobs are both doable and both similar amounts of effort. The gemini approach could involve setting up a labs instance that hits the beta cluster for instance. If we want to run within existing selenium tests we can't.

I propose that we turn this card into an epic (to keep subscribers and conversation in one place) and that we talk in person to align on following:

  • at what level do we test
  • when do we run? Daily/per commit
  • where to run (bc/vagrant/reading web staging)

Personally, I'm thinking steps within scenarios/per commit basis.
My own hope is to avoid unexpected ui changes generally in non-obvious parts of the interface (e.g. if we are making a change in the editor it may not be obvious that we might cause a css regression in the talk overlay. If it does we should report that in some way so that we can act accordingly (revert/fix/note down in a ticket).

I'm thus most interested in warning devs at the commit level that they might have caused a ui regression. For me, this means taking screenshots at certain steps inside browser tests. I don't see as much value in checking elements such as the chrome header as we pick those type of regressions up promptly. The two examples in the description would not have been helped by testing on the higher component level.

Having a separate tool that runs against the beta cluster concerns me as it's a shared environment and I think there would be more false positives as you say CentralNotice/QuickSurveys. I'm also wary about adding more complexity to our CI stack and providing yet another place to look. Gemini at a glance looks like it would require writing a new test suite separate from our selenium test suite for example.

@ssastry integration-visualdiff is for comparing two urls as loaded in the browser, right? No way to take actions before screenshotting (like clicking on things, filling inputs, etc)

You can inject CSS / JS before screenshotting. Ex: https://github.com/wikimedia/integration-visualdiff/blob/master/lib/parsoid.postprocess.js ... So, you take actions so long as you can represent that in the JS code.

can you expand on the separate repo for storing screenshots method, and how you keep that repo updated?

The Android team currently _doesn't_ use a separate repo for verified test result images which has caused our code repo to become bloated (T156887). I recommend just using a Git submodule and only pulling it down when you want to run the tests. If the repo becomes cumbersomely large or useless, you easily blow it away or choose to only pull it down for non-gating jobs.

Versioning files also makes image diffs work like code diffs. For example, if I change MobileFormatter.php, git status reports the change and git diff shows the actual difference. I might then wonder if my change had any visual difference so I can run the screenshot tests. If the tests pass, I know that screenshots are within tolerance. If you're lucky, the test screenshot results will also be byte for byte compatible and git status will show no file changes. However, it's easy for jitter to occur when running tests on different platforms so if you really want to see the differences you can either use Gemini GUI or Git itself. So you kind of get to leverage Git in a way that coders are pretty comfortable with but you do have to remember that results are not necessarily byte for byte and only failing changes require updated screenshots. Likewise, the Gerrit UI shows a percentage difference in the patch overview which I think comes from Git. I assume it's accurate but haven't tried updated my local Git config. Unfortunately, although this percentage difference is useful, the updated Grrrit no longer shows the images themselves (T153085) and never showed their differences so you really need to do the patch review of the committed images client side.

The screenshots in the Android repo are at the unit level and updated when they fail, either indirectly because we upgraded an app library or subcomponent or directly because we're iterating on a component. However, the Android tests require an emulator to be launched which is very slow so they only run every hour when a new commit is available. If a contributor misses a test failure, this means bad code can be committed or screenshots can become outdated and both require a separate commit to fix. Test failures are reported on our IRC channel.

In the past, we had an automation that also took screenshots of the whole app in many configurations. At the time, we didn't have great diffing tools but even if we did, these tests were super flaky, slow, and unsustainable. I recommend focusing on the unit.

For the page library, I'm planning to use screenshot tests to detect mostly CSS and layout regressions at the unit level per transform. For example, we're adding a dark mode transform and there's some really basic stuff like "the background of the header text should be dark and the text should be light", "transparent images should actually be dark on a white background", and "math formulas should be light on a dark background". Another example is our table collapse transform which has a strong dependency on interaction. The screenshot tests are more about appearance than DOM state so we'll probably verify both collapsed and expanded states appear correctly but won't be making any assertions about the DOM state after an interaction which our conventional unit tests cover. We also have some miscellaneous CSS that isn't transform specific. We're planning on breaking this up to into components like "list items", "table of contents", "math stuff", etc, and targeting our screenshot tests on each unit accordingly (kind of like writing a laundry list of items to check). At least that's what I think. This is my professional development project this quarter and this past Friday is the first I've looked at it :]

Gemini GUI is particularly nice because it gives you a visual diff UI to pass / fail each screenshot with a button press when you're initially inspecting a new or updated test's screenshots. This also means that folks who are unable to configure Git for images have another super easy option for checking test results . For Android, we don't have any nice UI so the contributor has to move all the test result screenshots to the versioned originals directory manually after they've signed off on the differences. The Gemini API is pretty focused and minimal and just kind of wraps Selenium or whatever is actually driving the tests under the covers. Here's an example from a toy project last year of taking screenshots on a variety of screen sizes in a generic way.

I was originally thinking of just focusing on ChromeDriver but I think I'd pretty quickly want to test other browsers. Selenium was particularly nice if only because you can do a selenium-standalone install and _download all the browsers automatically_ (kind of NPM-ish) and then start your server for all browsers with just selenium-standalone start. As I mentioned earlier, I'm hoping that Selenimum adds Chrome (and eventually Firefox) headless support but in the mean time I just run xvfb-run selenium-standalone start.

I think that picking a tool for screenshot regression testing is a tough assignment and that any tool picked from those presently available is likely to disappoint in some way and likely to be less useful if the team doesn't really embrace it.

Jdlrobson renamed this task from Spike: [8h] Explore using tools (e.g. webdrivercss) for css regression testing to EPIC: Detect and prevent UI regressions.May 19 2017, 12:01 PM
Jdlrobson edited projects, added Epic; removed Patch-For-Review, Spike.
Jdlrobson updated the task description. (Show Details)

I see it scheduled for Sunday, Room Heuriger, 10-13.

Can this session be moved to an hour later? 11-14 h instead of 10-13 h.

10-11 h on Sunday is T165729: Building Better Software (Hack-a-thon session) and my guess is that I am not the only one that would like to go to both sessions.

Writing a bunch of new tests outside Selenium seems like a bad idea to me. I'd like to revisit this when we've ported our Selenium tests to node ( T162256).
@zeljkofilipin I'm in Heuriger and it would be good to sync on this within the next hour if you are free.

@Niedzielski Thanks for the thoughtful reply. Good info about the git repo, and the level of abstraction for the tests.

We may have some different requirements for web given the cascading nature of CSS on a page. Units may be useful, but whole page would prevent the most regressions for us given how the code is.

I still am unsure on how we would update the screenshots. When would that happen? Jenkins on the post-merge? Manually?

@Jhernandez, on Android, screenshots are updated manually by the developer at commit time when they intentionally change code that updates the presentation. So a commit contains a code submission and some updated screenshots. Here's an example Gerrit patch where a code change in NavTabView.java required screenshot updates and it was all pushed up together in one ideal patch.

A more regular case for whole page screenshots will be the tests breaking independent of any code change made by the team. This requires a screenshot-only update commit, if the change is acceptable. Here's kind of an example of that. (Only kind of because this was actually due to one of the team's dev's changes but was only caught asynchronously of the breaking change.)

The tests run (and break) in Jenkins CI automatically but screenshot updates are local and manual. Does that make sense? If it's helpful, we can chat in a Hangout about this.

@zeljkofilipin I'm in Heuriger and it would be good to sync on this within the next hour if you are free.

Sorry for the late reply, I am still catching up. The morning session ended up taking the entire morning and I saw this late in the afternoon. I will take a look at the commits in gerrit.

Change 353902 abandoned by Jdlrobson:
WIP: UI regression tests using wdio-visual-regression-service

Reason:
POC

https://gerrit.wikimedia.org/r/353902

Change 353993 abandoned by Jdlrobson:
Add support for UI regression testing

https://gerrit.wikimedia.org/r/353993

Aklapper removed a subscriber: Nirzar.

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

Closing out this old task for now. I think this should be fixed on the component layer (WVUI) and the Vue.js migration team as we move towards a component-based UI.