Page MenuHomePhabricator

Refactor webrequest_source partitions and oozie jobs
Closed, DeclinedPublic

Description

Currently, all webrequest oozie jobs are based on the 'webrequest_source' partition. That is, there is a coordinator for each value in this partition. This makes managing these jobs painful, as you often have to manually check and manage each one. Also, it makes adding new or removing partition values difficult, as there is a large management overhead.

Also, the 'text' partition is becoming the catch all for webrequest data as Ops slowly merges more and more cache clusters together into the 'text' one. This makes the webrequest_source partition more and more useless, as most Hadoop queries need now need to pass through almost all requests in the text partition.

I'd like to be able to produce webrequest data to dynamic topics, based on something in the request log. I'm not sure what is a good choice for this, but perhaps something like project (en_wikipedia) or something else. Not sure.

This ticket is to track the reorganization of webrequest_source and oozie jobs.

Event Timeline

Ottomata raised the priority of this task from to Low.
Ottomata updated the task description. (Show Details)
Ottomata added subscribers: Ottomata, JAllemandou.
Nuria raised the priority of this task from Low to Medium.Apr 10 2017, 4:14 PM
Nuria removed a subscriber: Yurik.
mforns lowered the priority of this task from Medium to Low.Jul 31 2017, 3:57 PM
mforns moved this task from Dashiki to Deprioritized on the Analytics board.

Never worked on, and we are doing a small change to how webrequest ingestion works as part of T271232, althought it probably won't do what this ticket desires. Moving webrequest ingestion to the Refine pipeline would.