job_1442877556644_0009 on the Wikimedia altiscale cluster
Description
Description
Related Objects
Related Objects
Event Timeline
Comment Actions
Looks like we were running out of memory in the reducer. The job took nearly 48 hours to arrive at this failed state.
Job Name: org.wikimedia.wikihadoop.job.JsonRevisionsSortedPerPage$: MediaWikiRevisionXMLToJSONInputFormat(/user/halfak/stream... ID=1 (1/1) User Name: halfak Queue: default State: FAILED Uberized: false Submitted: Tue Sep 29 23:08:31 UTC 2015 Started: Tue Sep 29 23:08:39 UTC 2015 Finished: Thu Oct 01 11:43:25 UTC 2015 Elapsed: 36hrs, 34mins, 45sec Diagnostics: Task failed task_1442877556644_0009_r_000274 Job failed as tasks failed. failedMaps:0 failedReduces:1 Average Map Time 29mins, 6sec Average Reduce Time 41mins, 50sec Average Shuffle Time 17mins, 15sec Average Merge Time 4sec
Comment Actions
Here's the command I ran:
hadoop jar ~/jars/wikihadoop-0.2.jar \ org.wikimedia.wikihadoop.job.JsonRevisionsSortedPerPage \ -i /user/halfak/streaming/enwiki-20150901/xml-bz2 \ -o /user/halfak/streaming/enwiki-20150901/revdocs-bz2 \ -r 2000
Comment Actions
Looked at the logs: Seemed to be an interuption exception.
If so, there are chances that the issue comes from timeout.
There is a parameter that can be changed in the job (with a typo ...) that defaults to 1800000 (1/2h) --> can be changed to 3600000 (1h).
Also, the number of reducers could be set up a bit (2000 is not that big).
I'd like to see if the following run works:
hadoop jar ~/jars/wikihadoop-0.2.jar \ org.wikimedia.wikihadoop.job.JsonRevisionsSortedPerPage \ -i /user/halfak/streaming/enwiki-20150901/xml-bz2 \ -o /user/halfak/streaming/enwiki-20150901/revdocs-bz2 \ -r 5000 --task-tiemout 3600000
Let's talk about that today.
Comment Actions
I tested various memory, each failed.
I finally went and rewrote the job using core mapreduce API instead of using scrunch.
Job is still running but no error so far.