Implement dump of one smallish wiki using analytics "data lake" infra as proof of concept. This need not be a complete implementation, but enough to shake out some of the issues. A few different approaches may be tried (i.e. hive, spark, etc) for some dump outputs; the idea is to try a few different things and see what works better.
I subscribed everyone who was in yesterday's meeting; if you'd rather not, feel free to silently remove self, I won't be offended.
Had a short followup meeting with Dan yesterday evening. Will be using the "normal" queue for all jobs, starting with one table dump (output: sql.gz) and a revision metadata dump (output: xml.gz). Other steps will be added as we get the first ones done. We're keeping dump contents and formats unchanged since changes to those are out of the scope of this task (see e.g. T129022 which will be addressed later).
For these very first steps I'll dump the page table since it's one of those used in the analytics refinery, assume the stubs data is already "clean" and put off santization issues til later, ignore splitting of output into small files for download, and not put the result anywhere downloadable, nor worry about checking that the output files look good or handling retries upon failure.
After another couple q&a with Dan on irc, got the sqoop in working just fine for the page table, clearly it will be just as easy to add any other table(s) for which we do full mysqldumps. Here's notes on the setup and the run (which I did locally so that I'd be foreced to do the install/setup), as well as the test script used for the import:
The issue now is to get that data back out of avro into mysql format. Sqoop is not set up to do that; its strength is to import/export directly from database servers to Hadoop. I did some hunting for tools and found nothing so far.
Just a thought that if Hadoop adds no value for a particular flow, then it might be easier to use simpler tools. So if the workflow is:
mysql -> mysqldump -> serve stubs to public
Then this doesn't makes sense:
mysql -> sqoop -> hadoop -> reconstruct history -> export to sql statement format
But if the "reconstruct history" part is a useful addition to the stub dumps, then Hadoop has a place here. There may also be an argument to be made for monitoring and resource utilization being all in one place if we standardize on Hadoop. But that has to be weighed against complicating the process too much.
Sure has been quiet on this ticket. Just to break the silence, I've added some proto-stubs-generation code. Not fully implemented, not ready to run on analytics, etc. etc. Also, guaranteed to be the worst java anyone has ever read. https://github.com/wikimedia/dump-scheduler-eval/commit/f64f96065bbe0bb98b3a24c157769fcf0c12fd44
Many many things to do just to complete this stubs step, including check whether I can really use jaxb this way or whether that would require assembling the entire xml document in memory.
Time for an update. It turns out that because my code was mostly Java, Joal was the right person to look at it, and he needed to chat with me about importing all revision content into hadoop anyways, which would be needed not only for analytics work but for the next dump step. We talked a little at the allhands and more on IRC last week; both of us went away to read respective piles of code and we will talk again today (?) or shortly.
Good news is that hadoop has support for reading bz2 files not just as a single stream but with seeking to a given block:
https://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/io/compress/BZip2Codec.html for docs and
https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/BZip2Codec.java for code.
The dbpedia project has implemented a reader on top of this that reads bz2 dump content files in parallel: https://github.com/dbpedia/distributed-extraction-framework/blob/master/extraction/src/main/scala/org/dbpedia/extraction/spark/io/input/SeekableInputStream.scala
This is very good for getting an initial set of revisions into hadoop; updating can be done by other means.
For dumps production keeping the current format and output, we want to write XML dumps in parallel, meaning that one file would be created by multiple workers producing bz2 content blocks that can be combined together. This is harder because the bz2 format requires a running checksum which covers all blocks. Not all tools support the sort of 'multistream' bz2 output that e.g. the pages-articles-multistream.xml.bz2 dumps use.
TIme for another update. Joal and I chatted again; I'm working on using proper packages (instead of zero packages) for my little crap code, as well as maven for building. He's going to write a tiny scala example of how to use spark for getting at (a subset of) the stubs data rather than hive, so that I can see how to properly leverage parallelization. It appears that for the content phase the best we can hope for is one writer per xml output file, and since most of the bottleneck is in the recompression/write of those files, we can't expect much until we revisit output formats and delivery of downloadables later in the rewrite project. But that's a preliminary assessment, we'll see when we get there.
So @JAllemandou pinged me on Friday to let me know he had written the little sample. Except it's actually the entire job :-)
So my tasks are:
- read and understand this
- test it
- see what's needed to verify the output is the same as the regular dumps
Separately Joal reported that the filtered value of text_id as pulled from the labsdb replica is always 0, which is a blocker for this regardless.
I've checked into it and can't verify it in the code. Specifically, modules/role/templates/labs/db/views/maintain-views.yaml has the following entry:
select rev_id, rev_page, if(rev_deleted&1,null,rev_text_id) as rev_text_id, if(rev_deleted&2,null,rev_comment)
so text_id should be 0 only for the revisions that are flagged as deleted. I need an example of an entry where that's not true.
Bah, of course a week ago I committed locally my poor pom.xml plus packaging fixes for using maven for my little test case, but did not actually push it up to the live repo. And I was wondering why I got no comments. https://github.com/wikimedia/dump-scheduler-eval/blob/master/analytics_poc/stubs/pom.xml Here's that and the rest of the changes are in the same commit.
This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!
For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)