Currently, all webrequest oozie jobs are based on the 'webrequest_source' partition. That is, there is a coordinator for each value in this partition. This makes managing these jobs painful, as you often have to manually check and manage each one. Also, it makes adding new or removing partition values difficult, as there is a large management overhead.
Also, the 'text' partition is becoming the catch all for webrequest data as Ops slowly merges more and more cache clusters together into the 'text' one. This makes the webrequest_source partition more and more useless, as most Hadoop queries need now need to pass through almost all requests in the text partition.
I'd like to be able to produce webrequest data to dynamic topics, based on something in the request log. I'm not sure what is a good choice for this, but perhaps something like project (en_wikipedia) or something else. Not sure.
This ticket is to track the reorganization of webrequest_source and oozie jobs.