Page MenuHomePhabricator

Set up mass visual diff testing with a custom install of mediawiki
Closed, ResolvedPublic

Description

We want to set up visual diff testing as a useful tool to answer 'what if' questions about wikitext changes, i.e. we want to be able to change wikitext parsing and then see how rendering of this changes compared to standard rendering of the same page on the production cluster.

This requires the following to happen:

  1. Custom mediawiki install that has the desired patches that we want to test.
  2. Generate an XML dump (wikitext and all associated templates) for a corpus of revisions from the desired wikis.
  3. Use the XML dump to init the database of the mediawiki install.
  4. Visual differ configured to fetch the revisions from the custom mediawiki install and the corresponding production wiki (for proper comparison, the screenshots should be taken after running JS code on both fetched pages to expand all auto-collapsed boxes, sections, etc.)

We could potentially use labs instances to run these tests or acquire a dedicated server. In either case, this setup will need to be puppetized.

Related Objects

Event Timeline

ssastry raised the priority of this task from to Medium.
ssastry updated the task description. (Show Details)
ssastry added subscribers: ssastry, tstarling.

It might also make sense to have Parsoid installed on this same server and use it for mass visual diff testing, but it is not essential for the purposes of T89331.

One solution would be to set up a labs instance for this purpose and update code there. Dont have to worry about puppetization or anything as long as we have scripts to import the necessary pages with templates .. or maybe even start with a one-time setup without worrying about a script.

Adding some (edited) discussion from IRC:

<subbu> TimStarling, next step is to configure the m/w <-> m/w visualdiffing ... i'll explore labs vm for that .. thanks to bd.808's reworking of the portal, I now see "Tool Labs provides access to replicas of Wikimedia databases" .. at first glance, it looks like that might come in handy.
<subbu> do you know any more about that?
<TimStarling> well, I understand it is basically the same as the old toolserver system
<TimStarling> I think there were a few technical changes, like maybe private data is edited out of the replication stream now instead of relying on user permissions
<subbu> i see. mostly, i am wondering if that might eliminate the need to do dumps of the wikitext and associated templates.
<subbu> and then do a import of those.
<TimStarling> the idea is to run mediawiki on top of it? I'm not sure if that has been tried
<subbu> yes.
<subbu> i don't fully know what i am talking about yet .. mostly talking aloud since I noticed that sentence on the portal just now.
<TimStarling> you will have a read-only view of the database, which could theoretically work...
<TimStarling> maybe if you only use api.php and disable the job queue
<TimStarling> there is JobQueueMemory, which will throw away jobs
<TimStarling> that way you don't have to worry about running them
<TimStarling> $wgJobTypeConf['default']['class'] = 'JobQueueMemory';
<TimStarling> and also set $wgReadOnly = 'read only' for good measure
<TimStarling> localisation cache... would need configuration
<TimStarling> MessageBlobStore has moved out of the DB now so that should be OK
<TimStarling> as long as memcached is enabled
<TimStarling> allowing page views without writing to the database is one of Aaron's goals for the multi-DC work
<TimStarling> in his plan it will ultimately fall back to contacting the master if some module tries to write
<TimStarling> in your idea, it will get a DB error and die
<TimStarling> which maybe is OK if it rarely happens

The other thing to do is to pin the test pages to oldids rather than page titles (in the testreduce schema and code) so that we are comparing the same revision in the replica db (either via labs vm or otherwise) and the production db.

@yuvipanda said that the way to get databasae credentials for the replica servers is to create a tool and then copy the ~/replica.my.cnf file that is created for it. Anything in Labs can get to the servers, but the credentials files are only generated for tools these days for various reasons.

One way to get a MediaWiki stack setup is to use https://wikitech.wikimedia.org/wiki/Help:MediaWiki-Vagrant_in_Labs. I think all of the other Puppet roles for setting up MediaWiki (and maybe even LAMP in general) have been replaced with it.

I also just realzied that our replicas will be useless for you since they don't have the text table and we don't actually replicate External Store.

I think media can/should be handled with InstantCommons, ie set $wgForeignFileRepos to point to enwiki (and then also commons, if needed). This should avoid needing to have any of the media locally.

Then I suggest using one of the dump products on https://dumps.wikimedia.org/, either SQL or XML, rather than trying to come up with any subset mechanism or novel import strategy. @ssastry is considering using Special:Export and Special:Import on subsets of pages on 10s of wikis, but I think that mechanism is fragile and would require too much bespoke code we'd have to maintain.

For reference: https://www.mediawiki.org/wiki/InstantCommons -- the $wgForeignFileRepos var is an array, so i think you can stick (for example) enwiki first, before commons. More info at https://www.mediawiki.org/wiki/Manual:$wgUseInstantCommons and https://www.mediawiki.org/wiki/Manual:$wgForeignFileRepos. You might not even need to list commons in your $wgForeignFileRepos variable, enwiki might handle the fallover to commons itself.

I think media can/should be handled with InstantCommons, ie set $wgForeignFileRepos to point to enwiki (and then also commons, if needed). This should avoid needing to have any of the media locally.

Thanks! :-)

Then I suggest using one of the dump products on https://dumps.wikimedia.org/, either SQL or XML, rather than trying to come up with any subset mechanism or novel import strategy. @ssastry is considering using Special:Export and Special:Import on subsets of pages on 10s of wikis, but I think that mechanism is fragile and would require too much bespoke code we'd have to maintain.

The reason why I haven't considered dumps is because I am not convinced I can run a mediawiki install with full production dumps of 10s of wikis on a single labs vm. This will require provisioning of bare metal hardware ... which I am already working to obtain to run the actualy cpu intensive parts of running visual diffs.

https://github.com/wikimedia/mediawiki-services-parsoid-testreduce/tree/master/server/scripts are the scripts for generating and populating the rt-testing database. I think we can use adapt those scripts for the export/import strategy.

But, I welcome additional input.

What we *really* want is T91162: RFC: Shadow namespaces, then we could configure every namespace to point to its "parent" wiki. I wonder if there's beta code for that we could try out?

Here is another suggestion from @ori from when we were talking about it at the dev summit.

Since the app server cluster is over provisioned, he said we could check if we could get one of the app servers be dedicated for our tests. The good thing is that we won't have to worry about replicating test pages and associated templates, etc. We might be able to pass a X-* header similar to X-Mediawiki-Debug header for debugging in production and have the requests go that specific app server.

However, this is tricky for a bunch of reasons which is why I haven't pursued this.

  • What happens to parser cache?
  • How would we deploy code to this single app server?
  • This is actually on the production cluster and it feels a bit tricky ...

Another advantage of doing it off-production is that we could make this testing infrastructure available to anyone else ... imagine queueing up experiments. That said, let us consider all possibilities to get this final piece of the puzzle resolved and see what makes the most sense.

Anyway, at this time, I am still leaning towards the export/import strategy.

Ok, I had an old (2013) latest revisions xml dump of enwiki lying around on my laptop. The bz2 is 10g (so just under the 12g for the latest one). The uncompressed version is ~45g. so, 4.5x => today's enwiki dump will expand to 50g for enwiki. If we want to run tests against 20 wikis, and even if those other 19 ones are only 5g, we are looking at 150g, possibly closer to 200g if we populate the mysql db from latest dumps of all these wikis. This rules out running this on a labs vm with full dumps (even just latest revisions) from all these wikis because even the largest instance has 160g disk.

So, we need a way of sifting through the dumps. One solution that might work would be to import all templates, and only a subset of the pages. So, we just need a script that goes through the xml dump and picks this subset. See below for a count of pages by namespace.

10398900     <ns>0</ns>
  120145     <ns>100</ns>
    4187     <ns>108</ns>
  517543     <ns>10</ns>
    1323     <ns>12</ns>
 1042521     <ns>14</ns>
  788658     <ns>4</ns>
  839309     <ns>6</ns>
     127     <ns>710</ns>
     511     <ns>828</ns>
    1889     <ns>8</ns>

So, if we import all templates and a subset of pages from target wikis, we might be able to get the total db size to 10-20 gb.

Alternatively, we run the mediawiki instance as well as the visual diffing clients and server on promethium.wikitextexp.eqiad.wmflabs which has 450 gb or so of disk space. If we run both the visual diffing services on this server (50k pages each), and if we want to retain the screenshots and the diffs, assuming 2m total for the 3 images, that is 100k * 2m = 200g of disk space. So, it would be doable.

Or we could just use two VMs, so we don't need to put *all* the wikis on a single labs VM.

Or we could just use two VMs, so we don't need to put *all* the wikis on a single labs VM.

We could spin up multiple VMs yes.

Anyway, I am going to start with a single small VM and boot it up on, say, a nlwiki dump and see how it goes. We can nail down the process (vagrant, download dump, run scripts to init db, etc.) which we can finetune after.

After a bunch of messing around with different ideas, here is the process that looks like it is going to work best and is the simplest solution. We are going to use 2 labs VMs, one to run the "base" mediawiki code and another to run "experimental" mediawiki code. Both the VMs will be based off mediawiki-vagrant and will use almost identical vagrant and mediawiki configuration. The differences, if any, will be with respect to the experimental features that are being evaluated. In some cases, this difference will manifest in the mediawiki code, and not in the configuration. This eliminates the requirement of us having to keep matching production configuration in a pixel-perfect-rendering way.

So these two VMs run a multiwiki mediawiki install. We are going to pick a sample of 10s-of-thousands pages from a subset of production wikis and initialize the mediawiki dbs with a production dump of these pages (and any necessary templates, modules, etc.). We then run visual diffs that compares the rendering on these two VMs. These visual diffs should then give us a fairly reliable way of understanding how the experimental parsing / wikitext changes will affect production wikis.

Here are my mostly-finalized notes of how to go about this after experimenting and bugging @bd808 and verifying different pieces of this process.

Read https://wikitech.wikimedia.org/wiki/Help:MediaWiki-Vagrant_in_Labs
(instructions duplicated below ...)

* spin up vms via the wikitech console (via the manage instances link)
   -- assign a suitable name
* wait for it to be come up and verify you can ssh into it.
* click on the configure link (and select the role::simplelamp and role::labs::mediawiki_vagrant groups).
* login to the vm and run "sudo puppet agent -tv" (or let the puppet run happen automatically as part of the 30 min updates)
* LOG OUT AND LOG BACK IN (don't skip this step!)
* scp <wt_exp.pp> into the server, edit it suitably, and move to /srv/mediawiki-vagrant/puppet/modules/local/manifests
  -- right now this wt_exp.pp file is on my laptop
     but I could add it to a fork of mediawiki-vagrant
* run the following commands: 
------------
cd /srv/mediawiki-vagrant
echo "classes: [ 'local::wtexp_wiki' ]" > puppet/hieradata/local.yaml
vagrant up
# Not sure if other extensions are required, but
# for our purposes, these extensions are sufficient.
vagrant roles enable cite parserfunctions scribunto geshi math timedmediahandler wikihiero poem interwiki proofreadpage translate
vagrant provision
vagrant reload
------------
* scp xml dumps of the pages to import into the db
  -- generate the dumps via https://gerrit.wikimedia.org/r/#/c/274844/
* run the following commands
------------
vagrant ssh
cd /vagrant/mediawiki
# Run the following for all values of $WIKI you want to test on
mwscript maintenance/update.php --wiki $WIKI
mwscript maintenance/importDump.php --wiki $WIKI < $WIKI.dump.xml
mwscript maintenance/rebuildrecentchanges.php
------------
* Verify a bunch of pages
* Set up a proxy for the various individual wikis of the form $WIKI-$vmhost.wmflabs.org
  https://wikitech.wikimedia.org/wiki/Special:NovaProxy
  You will need N proxies in total, one for each $WIKI
  whose pages are being used in testing.
* open up the firewall for port 8080
  https://wikitech.wikimedia.org/wiki/Special:NovaSecurityGroup

On promethium.wikitextexp.eqiad.wmflabs, set up visual diff testing
Basically follow the puppet recipes manually since we cannot get
hiera to work with the puppet code I created that doesn't use classes
for everything. Marko tells me that hiera cannot use define_* stuff.
* clone visualdiff, uprightdiff, testreduce into /srv/ repos
* set up the various config files and tweak them appropriately
* create the mysql db and initialize with the list of titles that were
  exported into the base mediawiki dumps
--------------
* Experiment away! 
--------------

Here is a sample wt_exp.pp for setting up the multiwiki mediawiki install on each of the VMs.

# == Class: role::wtexp_wiki
# Configures a multi-site wiki for wikitext experiments
class local::wtexp_wiki
{
    mediawiki::wiki { 'en': }  
    mediawiki::settings { 'en settings':
        wiki   => 'en',
        values => [
            '$wgLanguageCode = "en";',
            '$wgSitename     = "Wikipedia";',
            '$wgUseInstantCommons = true;',
            '$wgUseTidy = true;',
        ]
    }

    mediawiki::wiki { 'nl': }  
    mediawiki::settings { 'nl settings':
        wiki   => 'nl',
        values => [
            '$wgLanguageCode = "nl";',
            '$wgSitename     = "Wikipedia";',
            '$wgUseInstantCommons = true;',
            '$wgUseTidy = true;',
        ]
    }

    mediawiki::wiki { 'fr': }  
    mediawiki::settings { 'fr settings':
        wiki   => 'fr',
        values => [
            '$wgLanguageCode = "fr";',
            '$wgSitename     = "Wikipédia";',
            '$wgUseInstantCommons = true;',
            '$wgUseTidy = true;',
        ]
    }
}

I have set up mw-base.wikitextexp.eqiad.wmflabs and mw-expt.wikitextexp.eqiad.wmflabs as the two VMs already and initialized the vagrant VM and mediawiki install. I've initialized the dbs with small dumps from nlwiki and enwiki and have set up proxies, etc. Last thing left to do is set up visual diffing on promethium and run tests.

For the first test, I have turned off Tidy on mw-expt.wikitextexp.eqiad.wmflabs. I'll finish the rest of the steps and update the notes. If all goes well, we should see a whole bunch of visual diffs from the visual diff test run.

Visual diff testing is now running on promethium which compares mediawiki renderings from the two VMS above. I verified from the first run with 21 titles (11 from nlwiki and 10 from enwiki) that the visualdiff results are as expected when Tidy is turned off.

Since we don't yet have a public DNS entry for this server (T129181 will fix that), here is a snippet of the curl result on that server.

		<li><b>100%</b> tested without errors,</li>
		<li><b>52.38%</b> showed less than 1% differences, and</li>
		<li><b>14.29%</b> rendered with pixel-perfect accuracy.</li>

So, ~33% of pages (7 pages) had > 1% diffs from turning off Tidy, and ~14% of pages (3 pages) had no diffs when Tidy was turned off. That in itself is interesting. Over 80% of pages have their rendering change because of Tidy.

Anyway, now that this first hurdle has been crossed, next steps for doing real mass scale testing are:

  1. Select a subset of wikis on which these tests should be run and pick a reliable sample of titles from those wikis. These set of scripts should be helpful here.
  2. Use this script to generate the full list of titles needed and use that new list to generate XML dumps for each of those wikis in step 1.
  3. Import these titles into the mediawiki db on the 2 vms.
  4. Import these titles into the testreduce mysql db on promethium.

Once this setup is completed, we can then update the mediawiki code and config on mw-expt.wikitextexp.eqiad.wmflabs and test. This is likely to expose new work to be done in mediawiki vagrant roles / puppet code.

ssastry renamed this task from Set up mass visual diff testing with a custom install of mediawiki and maybe parsoid to Set up mass visual diff testing with a custom install of mediawiki.Mar 8 2016, 12:13 AM
ssastry claimed this task.

Anyway, now that this first hurdle has been crossed, next steps for doing real mass scale testing are:

  1. Select a subset of wikis on which these tests should be run and pick a reliable sample of titles from those wikis. These set of scripts should be helpful here.
  2. Use this script to generate the full list of titles needed and use that new list to generate XML dumps for each of those wikis in step 1.

These two steps are now done. Dumps are now ready at http://dumps.wikimedia.org/other/testfiles/20160405/

  1. Import these titles into the mediawiki db on the 2 vms.

I have generated the custom mwvagrant .pp config file and the importer script via http://git.wikimedia.org/commit/mediawiki%2Fservices%2Fparsoid%2Ftestreduce.git/refs%2Fheads%2Fmaster ...

TODO: Run these script to import the above dumps

  1. Import these titles into the testreduce mysql db on promethium.

TODO

Once this setup is completed, we can then update the mediawiki code and config on mw-expt.wikitextexp.eqiad.wmflabs and test. This is likely to expose new work to be done in mediawiki vagrant roles / puppet code.

TODO.

So, here are additional notes for future reference. I'll later pull all these notes together into a coherent document and put it somewhere.

  • 20 GB diskspace turns out to be too tight because of all the images in /srv/. Bryan with help from others managed to add additional diskspace. But, for future reference, here is how you add additional available disk space upfront.
    • create new vm
    • enable role::labs::lvm::srv and force puppet run
    • enable role::labs::mediawiki_vagrant and force another puppet run
  • On this eswiki title, the import script crashed with a "PHP Fatal error: Maximum function nesting level of '200' reached, aborting! in /vagrant/mediawiki/extensions/Scribunto/engines/LuaCommon/LuaCommon.php on line 419" error. After chatting with Brad and Bryan, figured out that the solution is to
    • update the 'xdebug.max_nesting_level' => 200 setting in puppet/modules/php/manifests/remote_debug.pp to 1000.
    • vagrant provision
    • submit a git patch :-) since bryan wanted to increase this for everyone.
  • In some import runs, the import run stopped without any error message. Not sure what the cause was. I have either restarted the full import or pruned the dump to remove the already imported titles and reran the import on the rest.
  • In my very first run, I used nohup to run the import and teed the output to a file. But, screen is better and so I started using screen. But, in a non-login screen shell, all the usual vagrant commands no longer work. Bryan tells me that I have to use /usr/local/bin/mwvagrant in that scenario.

Additional notes for those following along:

  • Instead of the slow import via importDump.php process, an alternative would be to use MWDumper. When I last looked at that page many months back, I didn't look further because of the comment about "Beware that MWDumper has not been actively maintained since the mid-2000s, and may or may not work with current deployments. Some new maintenance is going on as of December 2015." I contacted @brion yesterday and he says: It should be *much* faster to use mwdumper (or another similar tool that writes straight to the database). The resulting database will be a bit light on metadata though -- links tables are not filled out for instance, and maybe other things ... You can usually resolve that by running rebuildLinks.php (not 100% sure that's the script name offhand, it's been a while) which you can at least run in the background _after_ the import. I *think* the output with the latest patches should work correctly on latest MW as run via vagrant, but I haven't tested lately with a giant input file. Since the dumps are almost 75% done and it is a matter of another 8-10 hours before they finish (and I am not babysitting them necessarily), I feel lazy to investigate this solution at this point. But recording notes here for future references.

Installed the following fonts on promethium for correct phantomjs rendering for indic and other language wikis

sudo apt-get install fonts-nanum-coding fonts-indic fonts-takao fonts-wqy-microhei fonts-wqy-zenhei

We enabled the simpler_miser vagrant role, except we disabled the file cache part of it since we want to be able to immediately see the effect any the code changes might have. We also disabled the parser cache ($wgParserCacheType = CACHE_NONE;) with the same logic with a file in the settings.d directory. Confusingly there will still be an HTML comment indicating that it was saved in the parser cache, except it really isn't and the timestamp will always change.

We now have functional visual diff testing enabled. https://www.mediawiki.org/wiki/Parsing/Visual_Diff_Testing has documention about how to manage this and update tests.

TODO:

  1. Cull information from this ticket into how to setup the VMs (for future repeatability).
  2. There are still a couple rough edges wrt repeatability of tests (a small proportion of tests have false failures because of crankiness of phantomjs, load on promethium, and occasionally overload on the VMs). Need to iron these out or document how to account for them.

The core work of this task has been accomplished and we have been using these tests successfully in the Tidy replacement project (T89331). See Replacing Tidy for details.

I am going to cull out details from this task later on for a wiki page.