Page MenuHomePhabricator

Dumps 2.0 for realz (planning/architecture session)
Closed, ResolvedPublic

Description

Unconference session Jan 5 at 10 am: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit_2016/2016-01-05

Session Notes: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit_2016/T114019/Minutes

During the above session, we will talk about: what are the main problems with current xml/sql/other dumps, for users? If we could redo the entire dumps system from scratch, how would we solve these issues?

Suppose we tossed all our existing dumps infrastructure. What would we want the new Dumps 2.0 to look like? All formats, all content, all means of generation/scheduling up for discussion. Includes: HTML format, XML? JSON? Incrementals, media, EVERYTHING. Let's have a plan. If you don't come, you'll wind up with the same old Dumps 1.0!

The existing dumps have lasted a long time but it's time to put them to bed. Why?

  1. Formats. We already have wikidata json dumps, HTML dumps from Restbase, XML dumps, mysql table dumps, and for a glorious breif while we had media image tarballs. These are/were all produced separately via separate tools, on different schedules, each as their own project. What we should have is one DumpManager to rule them all. Want dumps of X project in a new offline format? Write code for the job(s) needed that takes input/output filenames and a directory location, define dependencies and let the Manager do the rest.
  2. Incrementals. This has to be solved a different way than either the custom binary format (which gets harder when we add new fields to db tables) or than the horrid xml hack from the "adds-changes" dumps. We need to deal with this on the MediaWiki side, generating a list of X that have changed between Y and Z. How that then gets processed into nice content for the user is another discussion.
  3. Content. We have WikiData which has its own unique content; should we do something special here? Media tarballs are missing; there was discussion some time back about "tarballs on the fly" (for hundreds of GB at once). Is that reasonable, can we pare this down to a more manageable problem? What about the right to fork? If any specific content is downloadable by the user but not all of it by the same user, is the right to fork preserved?
  4. Scalability. The current code can't really be made much faster. We scale it by throwing more hardware at it, in a not very clever way. With the current approach we can't actually speed up the en wikipedia dumps; they will take longer and longer over time. The one DumpManager to rule them all should be able to farm jobs out to a little cluster of boring servers all alike and collect the output pieces back; want en wikipedia to run in a week? Make the cluster larger and it will.
  5. Reliability. Because the (XML) dumps are a complicated duct-tape-and-paper-clip thing, and also because jobs take a long time, they are prone to failure for various reasons, from network hiccups to db issues to MW code pushes. If jobs were cheap, we would record the failure, rerun it automatically, if it failed again we would reschedule it automatically with some delay and retry up to n times and then whine to IRC and email on final failure.
  6. Storage. These files are all sitting on one webserver with some sweet arrays. Maybe we want something else, a swift or ceph or some other distributed filesystem? Could any give dump be produced on local disks and then shoved into such a filesystem?
  7. Downloading. We currently cap downloads rather severely. If the backend storage was different, and we had 2-3 boxes in front with good bandwidth, could we make downloads much better for community members? What about access to WMF folks generating stats, or labs users, how can we get them "nearly live" data without bw limits?
  8. Maintainability. I've been hacking on these organically for 5 years now. And they were in use essentially unchanged years before that. Ewww. 'Nuff said.

What we should get done at the Summit:
Hash out the architecture and building blocks, with lists of things to research. Example: We need a job scheduler that can handle dependencies, reruns, collect output, monitor progress reports from job runners and feed them to a log, keep track of resources used per job per host, etc. What is available?
Determine what changes we would need to MW core in order to get a "changed content" list.
Rough outline of what we would need for media dumps to happen regularly without breaking the bank.
Parcel out research tasks and set a time frame for agreement on an architecture.

A little of this discussion has happened here:
https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Backlog/Improve_dumps
as well as at T88728

FOR CURRENT DISCUSSION go to: https://wikitech.wikimedia.org/wiki/Dumps/Dumps_2.0:_Redesign to read the starting point, then comment there (plus reference comment here) or directly on this task.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Qgil added a project: Dumps-Generation.EditedOct 12 2015, 10:46 PM

The expected fields are described at https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit_2016#Call_for_participation but yes, more or less they are in the description already. You are focusing on the Summit session, but about the prior discussion? "Hash out the architecture and building blocks, with lists of things to research" i something that can be started now.

Also, @ArielGlenn, can this task be assigned to you?

wpmirrordev added a comment.EditedOct 13 2015, 2:23 AM

Let me kick off the prior discussion.

0) Context

I have concerns about Formats and Incrementals (points one and two in the description above).

  1. Formats - the need to rewrite dump files

It is necessary to rewrite both XML and SQL dump files before importing them. This is true for the:

  • XML files always;
  • SQL files, when the user has a DBMS other than MySQL/MariaDB; and
  • SQL files, when the user wishes to avoid loss of service caused by the DROP TABLE command.

Dump transform tool example. I use a GAWK script to: a) remove the DROP TABLE and CREATE TABLE commands; b) substitude all INSERT commands with REPLACE; and c) scrape the CREATE TABLE command to determine the column order, and then substitute VALUES with (col1, col2, ...) VALUES. The first (a) prevents loss of service, second (b) prevents loss of good records, and the third (c) protects against the frequent changes in column order in the database schema T103583.

  1. Incrementals - missing incremental SQL dumps

The daily incremental dumps feature XML dumps only. This creates an impedence mismatch for the user. The complete dumps can be imported using transform tools such as mwxml2sql followed by importing the SQL dumps. By contrast, the daily incremental dumps can only be imported using maintenance/importDump.php, which is so slow that, for the largest wikis, importing each daily incremental dump takes longer than a day.

  1. Suggested solution

Given that dump files must be rewritten, let us:

3.1) dump each wiki as three files in XML format:

  • foowiki-yyyymmdd-pages-articles.xml.bz2;
  • foowiki-yyyymmdd-stub-articles.xml.gz;
  • foowiki-yyyymmdd-links-articles.xml.bz2; and

mutatis mutandis for pages-meta-current and pages-meta-history;
3.2) provide a well supported tool to transform the trio of XML dumps into SQL dumps for each supported DBMS; and
3.3) discontinue dumping SQL format.

  1. Initial steps

4.1) Dump generation. I am enhancing WP-MIRROR so than it can generate dumps in many formats (XML, SQL, HTML/ZIM, media, thumb). That way I can experiment without disturbing WMF dump operations.

4.2) Dump transform tool. As to point 3.2) above, I have written wpmirror-mwxml2sql in GAWK. Currently, it is a drop-in replacement for mwxml2sql (the @ArielGlenn version). The disadvantage is that it runs at about half the speed of mwxml2sql (this is the difference between interpreted GAWK and compiled C). The advantage is maintainability (greatly enhanced perspicuity, a fraction of the lines of code, no buffer overflows). To me, this tool could easily be extended to transform the above suggested trio of XML dump files into a complete set of SQL files for any supported DBMS (I would need to add a --dbms command-line option).

Obviously, I back my suggestion with a willingness to write code.

Halfak added a subscriber: Halfak.Oct 14 2015, 6:40 PM

I'd like to file a request for JSON one-line-at-a-time format of XML dumps. I represent a group of people who use the XML dumps for data analysis. One of my most successful utilities (see https://pythonhosted.org/mediawiki-utilities/core/xml_dump.html#mw-xml-dump and http://pythonhosted.org/mwxml/) converts the XML dump into a enumeration of page-partitioned revisions. This reduces the complexity of analysis code substantially (usually an order of magnitude fewer lines of code).

Further, I have been working with distributed processing frameworks like Hadoop and Spark. These systems are designed to process log files one line at a time. Needless to say, they work very badly with the current XML format. We (me and analytics engineers) have devoted a large chunk of time and engineering resources towards converting the XML dumps to line-by-line JSON so that we can work with them in these frameworks. I have been developing a schema that describes the format of this conversion. See https://github.com/mediawiki-utilities/python-mwxml/blob/master/doc/revision_document-0.1.0.json

Such JSON lines are the "recommended" format for WikiData's dumps too. See https://www.wikidata.org/wiki/Wikidata:Database_download#JSON_dumps_.28recommended.29

I'm not sure if such a JSON format should be primary or if we should just formalize the work that we've been doing in analytics to make the production of JSON files follow the production of XML files. I just wanted to flag that we're doing this and it would be good to move the work towards the source.

@wpmirrordev and @Halfak, these are both good suggestions (though I would likely not toss the sql dumps but keep them along with anything new we generate). But these are both secondary to what I have in mind; I want a new framework that will allow dumps in new formats to be easily added in to existing dumps; problems of scale to be (relatively) easily resolved if we are willing to add generic servers to a cluster; maintenance to be done by a number of people instead of only me; and rely in part on some sort of grid computing setup based on an existing open source project. I spent this week writing some notes with a couple of pictures to get at some of these issues and I'll get those on line this weekend. Getting a new framework with separate components deployed would allow anyone to add a JSON formatter to process the output of any particular dump, for example.

Having said that, please keep on with the suggestions, we'll use them, whether earlier or later.

You are focusing on the Summit session, but about the prior discussion? "Hash out the architecture and building blocks, with lists of things to research" i something that can be started now.

Indeed, and doing so.

Also, @ArielGlenn, can this task be assigned to you?

Did.

https://wikitech.wikimedia.org/wiki/Dumps/Dumps_2.0:_Redesign this discussion kick-off is not complete yet, it is some of the notes I mentioned earlier. I"ll comment here when I've got them all in. For now you can just follow along. Should be by tomorrow evening.

it's not done yet, sorry, making the diagram digital killed me today. rest of the notes (not much left) tomorrow.

The notes are now all online at the above location. Please remember that the diagram and the walkthroughs and generally all proposals are for getting this discussion started. Nothing is decided (well some of the goals are a must, but that's it). Please start commenting (here, preferably, or there if it's more convenient but notify here in that case).

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Oct 26 2015, 8:20 AM
ArielGlenn updated the task description. (Show Details)Oct 29 2015, 4:08 PM
LA2 added a subscriber: LA2.Nov 2 2015, 2:41 PM
LA2 added a comment.Nov 2 2015, 2:59 PM

This goes beyond dumps. Sometimes in Mediawiki, you want a structure that groups many pages together or that splits a page into sections.

  • Grouping can be achieved by creating subpages. A typical example is Wikisource that uses one page per book and one subpage for each chapter of that book, e.g. https://en.wikisource.org/wiki/Eskimo_Life
  • Splitting can be achieved by subheadings. A typical example is Wiktionary that has one page per word, and subheadings for each language that has a definition of that word, e.g. https://en.wiktionary.org/wiki/Eskimo

However, these structures are not reflected in dumps, templates, robot operations or searches. But maybe they should? It would be nice to be able to search for words only inside the current book without learning all the intitle: syntax for searches. Wiktionary has many templates with the argument lang=, which is almost always lang=fr in the ==French== section of the page and lang=de in the ==German== section of the page. If the subheading could set this as a context (in this section, assume lang=fr, like a local variable in a programming language), the parameter would not be needed in each template call. (It would only be needed when it deviates from the section default.) So, considering that the XML dump is a structured document, perhaps we should see the wiki as one large structure, rather than a flat set of pages.

@LA2 this is something I've long wanted for the Wiktionaries, which all have their own 'markup' for language, definition, translation and so on. I do think it's out of the scope of this project but something that should be raised separately.

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 9 2015, 11:58 AM

So guys and gals and the rest of us, I don't see any discussion yet on the architecture. We need to do as much of this _before_ the dev summit as possible, to make the best use of our time there. If you don't feel you know enough to comment, start asking questions; if you know enough to comment, start doing so. At a minimum I want people to comment on the architecture diagram and the various components; makes sense? Too heavy? Functions split up improperly? No grid computing tool will do what we need? Or you know just the thing?

Well, here are some comments and questions. Hopefully they aren't too obtuse, and might be of some utility.

comments

  • I like the idea of a central object store that contains various "chunks" of data which can be reprocessed by transformers, notifiers, downloaders, etc. I'm not sure what work is involved to produce this, but it's definitely the best feature of the redesign.
  • I think the architecture and various component functionality look fine, though it's too high-level to comment directly upon. I don't know if you're looking for direct feedback, but the job runner, logger, status monitor, etc look like an appropriate division of labor.

questions

  • What format are the chunks in the object store? The Redesign documents indicate uncompressed XML (and probably SQL for the table dumps) but I just want to be sure. Any possibility of JSON (which is less bulky) or SQLite (which is directly queryable)?
  • I honestly don't understand the grid concept. The document states: "This is some grid computing thing that we didn't write and we don't maintain.". Is this some third-party service, like Amazon Web Services (editorial: hope not) that does all the heavy-lifting? If so, how exactly does that happen since all the data resides on Wikimedia servers and would need to be transported anywhere else for work? Wouldn't the time saved in grid-computing be offset by data-transport?
  • I never understood why history dumps need to be redumped every month, especially since they are so tremendous -- particularly the English Wikipedia history dumps. Since each revision gets assigned an incrementing integer ID, can't it just be dumped once and not redumped again? For example, assume all edits between 2015-11-01 and 2015-11-30 will be assigned a revision_id of 31,000,000 to 31,999,999. Can't Wikimedia just dump one file called "enwiki-revision-31000000-31999999" once, and never have to dump it again? From a server perspective, this would save time and space. From a user perspective, it would mean downloading a large file only once, rather than downloading the entire terabyte data-set month after month.

Yay, comments!

So first: grid computing in this context means: open source software that schedules and runs jobs keeping track of available resources on its cluster, rerunning if necessary, able to manage dependencies (I think), and so on. In our case a pretty small cluster of hosts, that are just COTS hardware. NOT AT ALL sending files somewhere else; the idea is simply that right now we have our own several levels of python code that do this poorly, and it would be nice to use an already developed open source package that was designed for this from the start.

Second: we dump the history dumps ordered by page; that's what's most useful for most users. Since pages change from one dump to the next, we rewrite them. If we decided to do this in revision id order instead, and we got everyone to go along with it, rewriting their tools and such, we would still need to regenerate the dumps from time to time to remove deleted revisions (from deleted pages) or oversighted/hidden revisions. Yes, it's true that those would still be available on archive.org but we don't want to be providing them forever as apparently current content. This thing about the deleted/oversighted revisions (which might often contain e.g. personal identifying information) is something where we might want to get the Legal team's advice.
It would be nice to have true incrementals in a format easily parsable with standard tools, and that indeed is one of the things up for discussion; what those would look like, how they could be implemented.

Third (but first for you): the object store will have intermediate files in uncompressed something. I don't really care what they are. They might be XML if we stick to XML format; they might be something else. We want a format converter that can produce multiple output formats when it comes time to recombine these into nice downloadable pieces anyways. Uncompressed is needed so that recombining them is as fast as we can have it.

I am indeed looking for direct feedback on the division of labour. and on everything else. If it would be helpful I could describe in another part of that document what each of those pieces corresponds to in the current code (ewww) but I don't know if that would be generally useful or not. If/when we get general agreement about this high level overview we can start fleshing out the details. But if you have proposals, no reason to wait til then.

Ah! Thanks for the explanations. I don't have much to add, but just some minor points

OSS grid computing gets a breath of relief for me. I don't really have any experience in this arena though. Are there certain packages that are leading candidates?

I didn't realize that revisions get deleted. I would think that these events are relatively uncommon, and that the range encompassing the deletions can just be dumped again. Over time, old revision-ranges (for example, 2002-02-25 -> 2002-03-24) should become static. Even in the worst case, if 1 page from every revision-range gets deleted, each range will just need to be dumped again. This won't be any worse than today. :)

As for page-order dumps, I honestly can't see a valid use case for them. For example, the 1st range is p000000010 - p000002875 . This encompasses a wide range of unrelated pages including Alchemy, A, Aristotle, International Atomic Time, Air conditioning, Arabian Sea, Parallel ATA, etc. I really can't see someone saying "I'll download p000000010 - p000002875 but not p000002877 - p000005030." If someone were looking for the complete edit history for a handful of pages, they'd use the API. If someone were looking for the complete edit history for tens of thousands of pages, then they would probably be randomly spread out over all these page-ranges and they'd have to download the whole thing. The new paradigm would be no different than the old paradigm.

I think uncompressed XML is fine for the object store formats. I think JSON would be "lighter", but most of the existing tool support is probably around XML. So no strong feelings / arguments from me there. :)

However, I would think that there would be a mixture of other formats also: SQL for the table dumps and binary for the image dumps (whenever that happens). Is that the case, or will everything be standardized to XML?

I'll think a little more on the overall division of labor, but this isn't something I work with, so my input will probably be very shallow. Are there specific issues you're concerned about, such as is 2 rounds of failures enough to require manual intervention for a failed job?

About revisions getting deleted:

There are two ways for this to happen. One is that a page gets deletd; in that case the revision is no longer visible to a normal user and so it would not be dumped either. The second way it can 'disappear' is for it to be specifically oversighted because it has damaging content; in this case, the other revisions for the page are still accessible but the one revision can no longet be viewed by a normal user, and again it would not be dumped.

People use page-order dumps all the time because their bots or editing tools or history analyzers or whatnot all expect to look at the info for a single page at once.

I'm not wedded to any particular format for things in the object store. I'd prefer to have one format that fits all internally, but I could be talked out of that. I assume we'll provide XML final output and SQL final out because people have tools written for those formats. ANd we'll see what else people desire and what makes sense.

As to the division of labor, I'm concerned about all aspects of it :-) So if there's anything whatsoever that jumps out at you, please comment.

gnosygnu added a comment.EditedNov 24 2015, 5:06 AM

Cool. Thanks for the explanation on revision deletion.

Regarding page-order dumps, I do understand that there are existing tools built around them. However, I still think that revision-order dumps would be better for the majority of users, as well as Wikimedia in the long-run. Especially since page-order dumps basically have a shelf-life of 1 month (they need to be regenerated / redownloaded the next month). Revision-order dumps persist longer: they only need to be regenerated / redownloaded when revisions get deleted.

That said, I'm not one to press the point, and would be interested to hear counterpoints from page-order users.

As for object store format, I really would not want one-universal-format, as it would make for awkward shoehorning. I believe separate formats would work fine: SQL for table dumps; binary for image dumps; SQL / XML / JSON for revision dumps. Personally, I think it's better to use the right tool for each task than to try to craft one tool for all tasks.

I looked again at the redesign documents, and at the risk of being an echo chamber, I really think it is logically fine. I think the dump manager does do a lot of work (it has the most number of bi-directional interactions with the other components), so it seems like it may be the biggest bottleneck (it submits a job, is the exclusive partner for the client, processes all jobs when completed, etc.). However, I think it's fine to have one central manager, and don't see any easy ways to decompose its duties. Again, just my 2 cents.

I'm about to send a reminder email to the xmldatadumps mailing list; please forward it on anywhere useful. Guess I'll send it to wikitech too.

Notes on some possible job queue management/grid/multitask stuff. The go-to in python is celery, all exceptions from any job must be Pickleable (ewww), it has limited functionality but we might be ok with it because it's rather low level; another possibility is the kybernetes job management piece but this may be too high level for us. Collecting other suggestions and we should start a page someplace to evaluate alternatives once there are more than two. Kybernetes is god for managing jobs that take different amount of resources and celery is not, but with the dumps 2.0 straw man model we are going for 1 core per job so that all jobs look alike, though some may run longer than others.

TTO added a subscriber: TTO.Dec 4 2015, 10:17 AM

This doesn't seem clear to me, and I think I'm coming to agree with @GWicke's comment T119029#1866851 that this is probably not a good for the summit in its current form. That said, it would be very disappointing to kick the can down the road for some more time.

@ArielGlenn, you wrote: "If you don't come, you'll wind up with the same old Dumps 1.0!" Really? Do you have a concise answer for "if I were in charge, I would solve X with Y?", for the following values of X?

  • Format
  • Incrementals
  • Content
  • Scalability
  • Reliability
  • Storage
  • Downloading
  • Maintainability

You also say "FOR CURRENT DISCUSSION" and then point to a giant wall of prose. That doesn't look like a discussion to me. It looks like it might be time to revisit @Halfak's earlier comment, because that looks like the seeds of a discussion that were killed early.

First, the same old Dumps 1.0 are what are running now. If we don't get some collaboration on 2.0 there won't be a 2.0. And that's exactly what I meant when I said that if people don't show up we'll just have the same old dumps we always had.

Second, for current discussion: the document is meant to be a straw man proposall that people can pick apart for discussion, in particular the diagram, because without giving some starting point for discussion, no discussion was taking place.

Halfak's comment is fine, it's just premature, in the sense that if people agree that there ought to be some sort of formatter that produces output, we can have that produce xml, json, etc. The idea is not dead at all.

Hope that helps.

In a conversation on IRC robla suggested that this would be a very useful question so I post it here:

"do we need a new multi-format dump architecture to replace our XML-based system, or is there a better approach?"

I think there's a risk of second system syndrome here in the current proposal. I would start with the question "what's the most important problem to solve with the current system?" My instinct is that we can come up with a very simple system to solve @Halfak's problem, possibly a (hopefully temporary) "next gen" version that doesn't replace the old system at first. We keep iteratively solving problems with the new system, and then when the pain of maintaining two systems becomes too much, implement the old system on top of the new system.

OK, I'll backtrack then. There's some problems listed in the description but maybe none of those is "the most important problem to solve" with the current system. So I"m taking bids, what do people think? And yes as the maintainer I have an opinion but it's as a maintainer, not as a user, so I'll chime in after some other folks have spoken up.

Nemo_bis added a comment.EditedDec 19 2015, 8:37 AM

The biggest problem is IMHO clearly:

Downloading. We currently cap downloads rather severely

Which combined with T45647 makes the entire exercise of producing dumps futile for most users and researchers in the world.

The second biggest problem is https://meta.wikimedia.org/wiki/Right_to_fork i.e. T27602.

Problem 1 does not need to wait for dumps 2.0; the instant workaround (though a workaround and not a solution) is to download from the your.org mirror which provides reasonable download speeds. In the meantime I will open a ticket for this issue and cc you on it, gathering the network issues ticket you have, a ticket I have open for ms1001 eth bonding, and adding one suggested by paravoid about looking at dataset1001 memory and related issues that may impact disk i/o.

Problem 2 should get a task for it also (again it does not need to wait for dumps 2.0) though I can not promise to start on it right away.

I'm talking in the SF office with @GWicke about this now. He's pointed out T93396: Decide on format options for HTML and possibly other dumps to me, which seems to be stalled out. Is getting that unstalled relevant to this conversation?

(note: I accidentally put this comment on T119029, but I suppose the comment makes sense there too)

@ArielGlenn: To me it seems that the discussion so far lacks a shared agreement on what the most pressing problems with dumps are. This makes it difficult to evaluate candidate solutions and their trade-offs relative to the top priorities.

With the right preparation, a discussion at the dev summit could perhaps help to establish a shared agreement on the top problems to solve. It would be helpful if a candidate list could be worked out before the summit, so that it can inform the discussion.

I have scheduled this session on the Unconference track for tomorrow, Jan 5 at 10 a.m. We'll be identifying the main issues for users of the dumps (I already know what the main issues are for the maintainer!), and then discussing approaches to address those issues if we were able to design the dump system from scratch.

ArielGlenn updated the task description. (Show Details)Jan 4 2016, 11:16 PM
NealMcB added a subscriber: NealMcB.Jan 7 2016, 9:08 PM

I agree with @Halfak that many of the big XML dumps are very difficult to use, and a single-line JSON format could be easily parallelized by users and would be much more convenient to parse in modern languages. They should also be compressed with bz2, not gz, as noted in T115222.

I would second that work as a good high-priority starting point, as suggested by @RobLa-WMF

What was the outcome of the Unconference meeting?

We have session notes here: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit_2016/T114019

Lots to process still. I'm going to chat with Adam Wight (didn't get to do that at the dev summit) and also Gabriel Wicke (same) this week and add those notes to the etherpad so we don't lose them; I suppose I should copy those etherpad notes to a wiki page right after that, actually. See the action items, the ball's on me but I'm trying more to coordinate than to decide by fiat.

ArielGlenn updated the task description. (Show Details)Jan 13 2016, 8:24 PM
Qgil removed a subscriber: Qgil.Jan 13 2016, 10:56 PM

Wikimedia Developer Summit 2016 ended two weeks ago. This task is still open. If the session in this task took place, please make sure 1) that the session Etherpad notes are linked from this task, 2) that followup tasks for any actions identified have been created and linked from this task, 3) to change the status of this task to "resolved". If this session did not take place, change the task status to "declined". If this task itself has become a well-defined action which is not finished yet, drag and drop this task into the "Work continues after Summit" column on the project workboard. Thank you for your help!

notes are linked, also copied to wiki page so etherpad can go away,
followup task generation is in progress. Once that is complete this task
will be resolved.

The session notes have been updated with notes from a later discussion with Gwicke: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit_2016/T114019/Minutes#Gabriel_Wicke_discussion_notes

Next up: AWight

https://phabricator.wikimedia.org/tag/dumps-rewrite/ New project has been created, I've added you all to it I think, but please check to be sure.

On this task I need to chat with Milimetric and AWight and record those notes, then cull out questions for followup.

ArielGlenn moved this task from Backlog to active on the Dumps-Rewrite board.Feb 1 2016, 11:48 AM

@ArielGlenn - looking forward to the discussion

https://etherpad.wikimedia.org/p/WikiDev16-T114019 has rought notes from talk with millimetric's team, I will clean these up and move to the wiki page for the dev session soon. AWight and maybe one more Aaron are next. Then it will be question culling time.

Will chat with @Halfak tomorrow at around 18.00 my time i.e. EET (16.00 UTC I guess). Notes to go to etherpad first and then wiki page as usual. After that AWight and then done with info gathering for this round.

Chat done, notes on etherpad, will be cleaned up and added to wiki page notes soon. Next and last up: AWight.

AWight feedback to be gathered async, he's got a very full plate. So it might be put off til later, will find out soon.

AWight's note are now available on the session notes; Halfak's notes are also available there. Next up: cull questions that need to be answered in order to work on implementation.

nice job organizing, Ariel, let me know if you want to bounce the questions off someone

Yes indeed, I was going to start the draft list here and ask people to
please chime in/add/remove/fix.

Draft questions now at https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit_2016/T114019/Minutes/Questions

Please edit away. If you edit there please comment here so that task watchers (like me) know to go check the page for new stuff. Thanks!

Content. We have WikiData which has its own unique content; should we do something special here?

Flow now has dumps, though they're not running in production yet (open task).

Flow would need to fit into this framework, though.

Yes indeed, Flow and anything folks want to produce in the future. We want this to be as easy as adding a config section to a puppet manifest (once the script to produce the dataset is written and tested).

Following @awight 's suggestion at https://www.mediawiki.org/wiki/Talk:Wikimedia_Developer_Summit_2016/T114019/Minutes/Questions I'm going to break out the questions into several tracking tasks arranged by topic, each with blocking tasks for each question. Once that is done I'll close this task here and invite people to add/change/rearrange the new tasks so they make sense; we'll also have to prioritize them.

Samat added a subscriber: Samat.Mar 5 2016, 8:54 PM
ArielGlenn closed this task as Resolved.Mar 6 2016, 1:18 PM

I have broken out all the questions into tasks and added them as blocking tasks to one of the four groups outlined by @awight above. In the next days I'll start grabbing suggestions from the session and from notes and adding them on those tasks; dont' wait for that however. Please rewrite/clarify/add/move questions as you deem appropriate. Once we have settled on the list I'll prioritize a few that I'll try to move along towards conclusions first, but comment on any of them.

As promised this task is now being closed. If you're not getting notifications about those other tasks don't forget to subscribe to the Dumps-Rewrite project.

ArielGlenn moved this task from active to Done on the Dumps-Rewrite board.Mar 6 2016, 1:18 PM

Note that if you don't get the emails for the tasks you might not be watching this project, only a member. So check that too.

Kelson added a subscriber: Kelson.Mar 31 2016, 9:32 PM