Page MenuHomePhabricator

Sync pages (gadgets, in CSS and/or JS) from production wikis
Closed, DeclinedPublic


MaxSem wrote a small shell script triggered by a cron and maintained in puppet

Given a list of wikis / articles, the script will each day copy them to a beta wiki.

The point of that bug is to migrate that under Jenkins, that will let us tweak the job easily without relying on ops to merge our change and will also let us trigger the sync manually by rerunning the job.

See also



Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 2:04 AM
bzimport set Reference to bz49779.
bzimport added a subscriber: Unknown Object (MLST).

The Parsoid team has a list of about 160,000 pages from various language wikis (largest portion being from English WP) that they test their round tripping on. This list would probably be a great list to have auto-pulled into beta for general purpose testing.

See for parsoid's use of it.

Alternatively, a small script might spider two or three deep from Main_Page. That might give a good set of "likely to be high traffic" pages.

Assigning Ariel since we've been talking about how to do this recently and they're working on it.

For a list of high priority languages from Asaf, which I'd just trust blindly since I have no real domain knowledge, see this pdf:
(specifically, page 18)

He7d3r renamed this task from sync articles from production wikis (css/gadgets) to Sync pages (gadgets, in CSS and/or JS) from production wikis.May 31 2015, 5:48 PM
He7d3r updated the task description. (Show Details)
He7d3r added a project: JavaScript.
He7d3r set Security to None.

That script is not actually being run at the moment - no instance has class beta::syncsiteresources or the class that includes it, role::beta::bastion

hashar removed a project: Jenkins.

After two years, there is no champion for this task. I am thus assuming it is not of any specific interest nor it is getting in the way of people.

This has always been a desire of the Parsoid team: Beta doesn't have enough content to make it a worthwhile testing target for us.

Ideally we'd sync "almost all" prod content to beta on a regular basis (weekly/daily) so that if I find a parsing bug on arbitrary page [[X]], I can check that the fix running in beta on [[beta:X]] fixes the problem. The larger the subset of prod content which is sync'ed, the more likely that is to be true. The more frequently the sync is done, the more likely beta is to have the particular content which is causing problems.

This includes all language content as well; I had to create zh and sr wikis in beta because previously beta wasn't running any language converter wikis at all. Now at least I have wikis to test on which have language converter enabled, but neither has any substantial content, much less the complex infrastructure of languageconverter rules and glossaries currently running on zhwiki.

In a previous job, there was a nightly database-export/sanitization/import-into-test pipeline that ensured the test environment was representative. I believe that we can't use raw database backups as the basis for sync here at WMF because the "sanitization" step (to remove private user info) is basically impossible to do safely. But we do export db dumps, and those dumps presumably already omit any sensitive data, so it may be feasible to establish a nightly/weekly/monthly sync based on those.

Wikidata alone has 78 million main namespace pages. You'd be talking about a huge leap in the amount of storage required. I don't know what thoughts you have about getting the pages into the beta database, but the usual Import.php method will be prohibitively slow for anything like this, at even 100th of the number of pages.

You might be better off importing just the specific page, along with the most recent revisions of the templates and Lua modules it uses, if you want to be able to test Parsoid changes on beta.