- Generate diffs for Simple & English
- Generate and compare word persistence measures
Trello card: 3Uwlwoxk
- column: In Progress
Trello card: 3Uwlwoxk
Comments from Trello:
Enwiki diff job finished!
Last tests on Altiscale cluster suggest that we are very close.
This is blocked on altiscale cluster stuff
Altiscale hadoop testing
Substantial progress at the hackathon thanks to Ansgar. We worked out some methods for weighting contributions to pages that don't have enough follow-up revisions to evaluate persistence.
I completed the diff calculation for the 20150602 dump. I am crashing on the diffs2persistence calculation though. I'm considering running the script on stat1003.wmf so that I can track the resource usage more closely.
I found out that I can't process snappy-compressed files on a unix machine. snzip simple doesn't support hadoop's snappy format. So I converted all of the data files to bz2 and transferred them to stat3. I'll be starting a test run on the data tonight.
I've been processing the data for about a week now. We should be finishing up tonight. I've learned that Hadoop is crazy and *I'm* not running out of memory. Usage is sub 500MB for all mappers.
For the next run, I'll crank up the number of mappers. We should be able to process the same amount of data in half the time.
I've also been extending the mwpersistence library. It's quite powerful, so it will make it easier to repeat this process for other wikis or to update the enwiki dataset.
I've discovered a bug deep in the difference algorithm. I've patched it and added tests, but this puts me back really far since it affects the beginning of the pipeline.
Rough pipeline: [Diffs] --> [Token stats] --> [Revision stats]
The diff portion takes, by far, the longest as well.
I've started transferring the enwiki-20150901 dump to altiscales servers. I'll be kicking off the JSON extractor as soon as it is ready.
So, I didn't talk about the complete pipeline in my previous post. It looks more like this:
[XML] --> [Revdocs (unsorted)] --> [Revdocs (sorted)] --> [Diffs] --> [Token stats] --> [Revision stats]
The JSON extractor was supposed to take care of the first two steps in that pipeline:
[XML] --> [Revdocs (unsorted)] --> [Revdocs (sorted)]
In the meantime, I started a process on stat1003 to do the first step in the pipeline.
[XML] --> [Revdocs (unsorted)]
And I started a job on the Altiscale cluster to sort some old data into the new sorted format so that, at the very least, I can start processing some old data (2015-06-02) while my new data (2015-09-01) is being prepared.
[Revdocs (unsorted)] --> [Revdocs (sorted)]
As part of my refactoring work to get the sorting part of the pipeline into a hadoop streaming job, I picked up the json2tsv library pypi and started doing some cleanup. They've made me a maintainer because they got tired of my incessant pull requests. :D
I've got the new [Revdocs (sorted)] for enwiki-20150602. I just kicked off the map-only diff processor on the altiscale cluster. I lowered the memory allocated from 5012MB to 1024MB to see if we can push parallelism a bit now that Hadoop isn't doing it's memory buffering of doom during the reduce.
Checking in. We've completed 145 out of 2000 mappers (many others are 99% done). We've had none fail. All looks pretty good. We're running 172 mappers in parallel. Given that we're not running out of memory with 1024MB, we might be able to push it even lower. This streaming python diff alg. is super efficient.
So, the diffs are finished. I tried to generate [Token stats] on hadoop. I figured that, since we were switching to a mapper-only strategy and the diffs worked so well with 1GB per mapper, I'd try the same strategy again. I tried runs with 1GB, 4GB and 8GB mappers. Regretfully, each run crashed quickly with out of memory errors. What is hadoop doing with all that memory!?
After the 8GB failure, I transfered the diff data to stat1003 and started up my 16 duct-tape mapper strategy again and that is chugging along (using ~500MB per mapper). So I should have new results in a few days.
I talked to the Altiscale folk about the issue. They were perplexed and have accepted my Job IDs and a description of the issue. They'll review and get back to us.
Quick status update: I've processed 780 input files out of 2000.
Also, I should note that the altiscale engineers identified rare, periodic memory usage spikes that happen during the [Diffs]-->[Token stats] process. I'll be digging into that to see if there's anything I can do to limit memory usage in a predictable way.