Early in Parsoid's development, we started off RT-testing with 100K pages from enwiki. Sometime in 2013, we switched to 160K pages with 10K pages each from 16 different wikis. At this time, we are almost close to being done with fixing the most important semantic failures in this set -- all that is left now are some edge cases and diffs resulting from wikitext errors that we aren't going to support.
In the light of wanting to deploy VE to enwiki, we should do another refresh of the RT-testing pages, but introduce a bigger pool of pages from enwiki and proportionately reduce the set of pages from smaller wikis (instead of 10K from all wikis). We should likewise, introduce a small set of pages (1K each?) from a few different wiktionaries and other non-wikipedia wikis to uncover use cases specific to those wikis (as in T101599).