Page MenuHomePhabricator

Use spark to split webrequest on tags
Closed, DeclinedPublic8 Estimated Story Points

Description

This task also includes the work done in looking for the correct solution (spark was an option).
Option choosen is: use hive dynamic partitions, and filter/group/change names for tags based on small table.

Event Timeline

Nuria updated the task description. (Show Details)May 4 2017, 4:29 PM
Nuria moved this task from Incoming to Operational Excellence Future on the Analytics board.
Milimetric triaged this task as High priority.May 16 2017, 12:49 PM
Milimetric triaged this task as High priority.
Milimetric raised the priority of this task from High to Needs Triage.
Milimetric triaged this task as High priority.
JAllemandou renamed this task from Spike, test idea on spark job that reads tags and produces different outputs to Use hive dynamic partitioning to split webrequest on tags.Jun 8 2017, 1:54 PM
JAllemandou claimed this task.
JAllemandou updated the task description. (Show Details)
JAllemandou set the point value for this task to 8.
JAllemandou edited projects, added Analytics-Kanban; removed Analytics.
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 357814 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] [WIP] Split webrequest into smaller datasets

https://gerrit.wikimedia.org/r/357814

Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.

Is there a place where tags used for splitting are recorded (beyond the actual webrequest_split_tag table)?

Nuria added a comment.EditedJul 26 2017, 8:09 PM

Is there a place where tags used for splitting are recorded (beyond the actual webrequest_split_tag table)?

No, the splitting process is not yet in place, once it is that table will be the one we use. Before splitting happens this change needs to be effective: https://phabricator.wikimedia.org/T171760

JAllemandou renamed this task from Use hive dynamic partitioning to split webrequest on tags to Use spark to split webrequest on tags.Apr 16 2018, 7:04 AM
mforns moved this task from In Progress to Paused on the Analytics-Kanban board.May 7 2018, 3:45 PM
elukey added a subscriber: elukey.Aug 1 2018, 4:22 PM

Change 465202 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Update DataFrameToHive for dynamic partitions

https://gerrit.wikimedia.org/r/465202

Change 465206 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Add webrequest_subset_tags transform function

https://gerrit.wikimedia.org/r/465206

Change 468322 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Add WebrequestSubsetPartitioner spark job

https://gerrit.wikimedia.org/r/468322

Change 465202 merged by jenkins-bot:
[analytics/refinery/source@master] Update DataFrameToHive for dynamic partitions

https://gerrit.wikimedia.org/r/465202

Change 465206 merged by jenkins-bot:
[analytics/refinery/source@master] Add webrequest_subset_tags transform function

https://gerrit.wikimedia.org/r/465206

Change 468322 merged by jenkins-bot:
[analytics/refinery/source@master] Add WebrequestSubsetPartitioner spark job

https://gerrit.wikimedia.org/r/468322

Change 357814 merged by Joal:
[analytics/refinery@master] Add oozie job partitioning webrequest subset

https://gerrit.wikimedia.org/r/357814

Change 471693 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Update DataFrameToHive dynamic partition mode

https://gerrit.wikimedia.org/r/471693

Change 471722 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Update wikitext oozie job

https://gerrit.wikimedia.org/r/471722

JAllemandou removed JAllemandou as the assignee of this task.Oct 15 2019, 9:21 PM
JAllemandou removed a project: Analytics-Kanban.
JAllemandou added a subscriber: JAllemandou.
Nuria closed this task as Declined.Apr 8 2020, 6:48 PM