Page MenuHomePhabricator

Kick off regular (weekly?) visual diff runs comparing Parsoid rendering and default M/W rendering
Closed, ResolvedPublic

Description

We already have visualdiff code and the testreduce_vd database installed on scandium. A couple years back, we used to occasionally run visual diffs between Parsoid/JS and the default rendering and fix Parsoid bugs. Now that we are shooting to make Parsoid the default on Wikimedia wikis for all wikitext use cases in the mid-to-end 2021 timeframe, it is time to kick off regular visual diff runs so we track progress towards that goal and start identifying any wikitext patterns that need to be linted.

After some trial runs, we may need to refresh the test pages to capture pages from a wide range of projects and namespaces. As the diffs reduce, the test run will complete faster as well since uprightdiff doesn't have to work as hard to identify diffs, and at that point, we will be able to increase the size of the test set.

Alternatively, we should explore if we want to run these tests on parsing-qa-01.eqiad.wmflabs instead of scandium since we can get both Parsoid/PHP rendering as well as default rendering by hitting public APIs and we probably don't care running these diff tests on bleeding edge Parsoid/PHP code which we can only access on scandium.

Event Timeline

Updated the visual diff repo to reflect latest status of Parsoid and Mediawiki with https://github.com/wikimedia/integration-visualdiff/commits?author=subbuss&since=2020-07-01&until=2020-07-11

I still am debugging some minor annoying diffs which shouldn't be there. Once I've resolved this, we can kick off mass test runs sometime next week.

parsing-qa-01 is now ready again with npm and node10 installed. The visualdiff repo is also now more up to date and works with puppeteer instead of phantom and yields better diffs as a result. Next step is to do some test runs, tweak settings, prepare a bigger test set (10K pages or so form a small subset of wikis) and take it from there.

Earlier this week, I've got this test run going and I am right now doing a bunch of tweaks so that baseline metrics are not artificially low / deflated.