Unconference session Jan 5 at 10 am: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit_2016/2016-01-05
Session Notes: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit_2016/T114019/Minutes
During the above session, we will talk about: what are the main problems with current xml/sql/other dumps, for users? If we could redo the entire dumps system from scratch, how would we solve these issues?
Suppose we tossed all our existing dumps infrastructure. What would we want the new Dumps 2.0 to look like? All formats, all content, all means of generation/scheduling up for discussion. Includes: HTML format, XML? JSON? Incrementals, media, EVERYTHING. Let's have a plan. If you don't come, you'll wind up with the same old Dumps 1.0!
The existing dumps have lasted a long time but it's time to put them to bed. Why?
- Formats. We already have wikidata json dumps, HTML dumps from Restbase, XML dumps, mysql table dumps, and for a glorious breif while we had media image tarballs. These are/were all produced separately via separate tools, on different schedules, each as their own project. What we should have is one DumpManager to rule them all. Want dumps of X project in a new offline format? Write code for the job(s) needed that takes input/output filenames and a directory location, define dependencies and let the Manager do the rest.
- Incrementals. This has to be solved a different way than either the custom binary format (which gets harder when we add new fields to db tables) or than the horrid xml hack from the "adds-changes" dumps. We need to deal with this on the MediaWiki side, generating a list of X that have changed between Y and Z. How that then gets processed into nice content for the user is another discussion.
- Content. We have WikiData which has its own unique content; should we do something special here? Media tarballs are missing; there was discussion some time back about "tarballs on the fly" (for hundreds of GB at once). Is that reasonable, can we pare this down to a more manageable problem? What about the right to fork? If any specific content is downloadable by the user but not all of it by the same user, is the right to fork preserved?
- Scalability. The current code can't really be made much faster. We scale it by throwing more hardware at it, in a not very clever way. With the current approach we can't actually speed up the en wikipedia dumps; they will take longer and longer over time. The one DumpManager to rule them all should be able to farm jobs out to a little cluster of boring servers all alike and collect the output pieces back; want en wikipedia to run in a week? Make the cluster larger and it will.
- Reliability. Because the (XML) dumps are a complicated duct-tape-and-paper-clip thing, and also because jobs take a long time, they are prone to failure for various reasons, from network hiccups to db issues to MW code pushes. If jobs were cheap, we would record the failure, rerun it automatically, if it failed again we would reschedule it automatically with some delay and retry up to n times and then whine to IRC and email on final failure.
- Storage. These files are all sitting on one webserver with some sweet arrays. Maybe we want something else, a swift or ceph or some other distributed filesystem? Could any give dump be produced on local disks and then shoved into such a filesystem?
- Downloading. We currently cap downloads rather severely. If the backend storage was different, and we had 2-3 boxes in front with good bandwidth, could we make downloads much better for community members? What about access to WMF folks generating stats, or labs users, how can we get them "nearly live" data without bw limits?
- Maintainability. I've been hacking on these organically for 5 years now. And they were in use essentially unchanged years before that. Ewww. 'Nuff said.
What we should get done at the Summit:
Hash out the architecture and building blocks, with lists of things to research. Example: We need a job scheduler that can handle dependencies, reruns, collect output, monitor progress reports from job runners and feed them to a log, keep track of resources used per job per host, etc. What is available?
Determine what changes we would need to MW core in order to get a "changed content" list.
Rough outline of what we would need for media dumps to happen regularly without breaking the bank.
Parcel out research tasks and set a time frame for agreement on an architecture.
A little of this discussion has happened here:
https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Backlog/Improve_dumps
as well as at T88728
FOR CURRENT DISCUSSION go to: https://wikitech.wikimedia.org/wiki/Dumps/Dumps_2.0:_Redesign to read the starting point, then comment there (plus reference comment here) or directly on this task.