Page MenuHomePhabricator

Historical analysis of edit productivity for English Wikipedia
Closed, ResolvedPublic


  • Generate diffs for Simple & English
  • Generate and compare word persistence measures

Trello card: 3Uwlwoxk

  • column: In Progress

Event Timeline

Comments from Trello:

2015-04-09 Halfak:

Enwiki diff job finished!

2015-04-02 Halfak:

Last tests on Altiscale cluster suggest that we are very close.
2015-02-06 Halfak:

This is blocked on altiscale cluster stuff

Altiscale hadoop testing

Substantial progress at the hackathon thanks to Ansgar. We worked out some methods for weighting contributions to pages that don't have enough follow-up revisions to evaluate persistence.

Halfak renamed this task from [Q2] Measuring quality/productivity to Historical analysis of edit productivity for English Wikipedia.Jul 2 2015, 10:22 PM
Halfak moved this task from Paused to In Progress on the Research board.
Halfak set Security to None.

I completed the diff calculation for the 20150602 dump. I am crashing on the diffs2persistence calculation though. I'm considering running the script on stat1003.wmf so that I can track the resource usage more closely.

I have the diffs on stat1003. I just need to re-write my multiprocessing script to handle the new format.

Working on a library for processing the diffs in a distributed way. See

I found out that I can't process snappy-compressed files on a unix machine. snzip simple doesn't support hadoop's snappy format. So I converted all of the data files to bz2 and transferred them to stat3. I'll be starting a test run on the data tonight.

I've been processing the data for about a week now. We should be finishing up tonight. I've learned that Hadoop is crazy and *I'm* not running out of memory. Usage is sub 500MB for all mappers.

For the next run, I'll crank up the number of mappers. We should be able to process the same amount of data in half the time.

I've also been extending the mwpersistence library. It's quite powerful, so it will make it easier to repeat this process for other wikis or to update the enwiki dataset.

Halfak updated the task description. (Show Details)

I've discovered a bug deep in the difference algorithm. I've patched it and added tests, but this puts me back really far since it affects the beginning of the pipeline.

Rough pipeline: [Diffs] --> [Token stats] --> [Revision stats]

The diff portion takes, by far, the longest as well.

So! This gives me an opportunity to use @JAllemandou's new sorted JSON extractor on new data and check to see if we get better performance with the map-only strategy we've been discussing.

I've started transferring the enwiki-20150901 dump to altiscales servers. I'll be kicking off the JSON extractor as soon as it is ready.

So, I didn't talk about the complete pipeline in my previous post. It looks more like this:

[XML] --> [Revdocs (unsorted)] --> [Revdocs (sorted)] --> [Diffs] --> [Token stats] --> [Revision stats]

The JSON extractor was supposed to take care of the first two steps in that pipeline:

[XML] --> [Revdocs (unsorted)] --> [Revdocs (sorted)]

But it appears to have failed. I filed a ticket on the Analytics-Backlog. See T114359.

In the meantime, I started a process on stat1003 to do the first step in the pipeline.

[XML] --> [Revdocs (unsorted)]

And I started a job on the Altiscale cluster to sort some old data into the new sorted format so that, at the very least, I can start processing some old data (2015-06-02) while my new data (2015-09-01) is being prepared.

[Revdocs (unsorted)] --> [Revdocs (sorted)]

As part of my refactoring work to get the sorting part of the pipeline into a hadoop streaming job, I picked up the json2tsv library pypi and started doing some cleanup. They've made me a maintainer because they got tired of my incessant pull requests. :D

I've got the new [Revdocs (sorted)] for enwiki-20150602. I just kicked off the map-only diff processor on the altiscale cluster. I lowered the memory allocated from 5012MB to 1024MB to see if we can push parallelism a bit now that Hadoop isn't doing it's memory buffering of doom during the reduce.

Checking in. We've completed 145 out of 2000 mappers (many others are 99% done). We've had none fail. All looks pretty good. We're running 172 mappers in parallel. Given that we're not running out of memory with 1024MB, we might be able to push it even lower. This streaming python diff alg. is super efficient.

So, the diffs are finished. I tried to generate [Token stats] on hadoop. I figured that, since we were switching to a mapper-only strategy and the diffs worked so well with 1GB per mapper, I'd try the same strategy again. I tried runs with 1GB, 4GB and 8GB mappers. Regretfully, each run crashed quickly with out of memory errors. What is hadoop doing with all that memory!?

After the 8GB failure, I transfered the diff data to stat1003 and started up my 16 duct-tape mapper strategy again and that is chugging along (using ~500MB per mapper). So I should have new results in a few days.

Hey Aaron,
We should review together the approach for you token stats job.
I can't say as is why it takes so much memory :)

I talked to the Altiscale folk about the issue. They were perplexed and have accepted my Job IDs and a description of the issue. They'll review and get back to us.

Quick status update: I've processed 780 input files out of 2000.

And... I've made a mistake. I accidentally re-processed the old diff data. I need to re-start the process to work with the new diff data :( Kicked that off on stat1003.

Also, I should note that the altiscale engineers identified rare, periodic memory usage spikes that happen during the [Diffs]-->[Token stats] process. I'll be digging into that to see if there's anything I can do to limit memory usage in a predictable way.

New [Token stats] generation complete and sample extracted for analysis. Fun story, Token stags extraction is much faster with reasonable diffs :)

Question on the previous link "reflection" section: What about bots?

+1. I think bots and other tools may explain a lot of an increase in efficiency of content production.

I've started trying to get this data loaded onto the altiscale Research Cluster so that I can use HIVE to query it. I'll be working on ways to flag bots with this pattern.