Page MenuHomePhabricator

Setup pipeline for search logs to travel through kafka and camus into hadoop {hawk} [55 pts]
Closed, ResolvedPublic

Description

MediaWiki will soon be writing to a new topic in kafka, mediawiki_CirrusSearchRequestSet. This will be formatted with apache avro and needs to flow through camus and into hadoop.

  • create a new topic in Kafka (trivial)
  • camus imports with Avro (should be easy but never done before)
    • camus needs to know which is the timestamp field
  • figure out how camus will get the schema

Schema will only be registered in the mediawiki repo.
Search team will take care of creating hive tables.

Event Timeline

EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
EBernhardson added a subscriber: EBernhardson.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 23 2015, 8:11 PM
EBernhardson added a project: Analytics.
EBernhardson set Security to None.
kevinator triaged this task as High priority.Sep 24 2015, 5:17 PM
kevinator moved this task from Incoming to Prioritized on the Analytics-Backlog board.
kevinator renamed this task from Setup pipeline for search logs to travel through kafka and camus into hadoop to Setup pipeline for search logs to travel through kafka and camus into hadoop {hawk} [21 pts].Sep 28 2015, 5:04 PM
kevinator updated the task description. (Show Details)
kevinator moved this task from Prioritized to Tasked on the Analytics-Backlog board.
madhuvishy reassigned this task from Ottomata to Nuria.Sep 29 2015, 8:08 PM
madhuvishy moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 243845 had a related patch set uploaded (by Madhuvishy):
[WIP] Add support to camus to consume schemaID-less avro

https://gerrit.wikimedia.org/r/243845

Change 243990 had a related patch set uploaded (by Madhuvishy):
[WIP] Add refinery-camus module

https://gerrit.wikimedia.org/r/243990

Change 243845 abandoned by Madhuvishy:
[WIP] Add support to camus to consume schemaID-less avro

Reason:
This patch is superseded by https://gerrit.wikimedia.org/r/#/c/243990/

https://gerrit.wikimedia.org/r/243845

Change 244594 had a related patch set uploaded (by Madhuvishy):
[WIP] Add properties file for importing mediawiki data

https://gerrit.wikimedia.org/r/244594

Change 244601 had a related patch set uploaded (by Madhuvishy):
[WIP] analytics:Add cron that schedules camus imports for mediawiki data

https://gerrit.wikimedia.org/r/244601

Change 244594 abandoned by Ottomata:
[WIP] Add properties file for importing mediawiki data

Reason:
I've moved camus properties files to puppet, and amended https://gerrit.wikimedia.org/r/#/c/244601/2 with this mediawiki properties file.

https://gerrit.wikimedia.org/r/244594

Change 243990 merged by Joal:
Add refinery-camus module

https://gerrit.wikimedia.org/r/243990

Hey @EBernhardson, your schema needs a timestamp field in order for Camus to import the data into proper partitions.

It should be a unix seconds value in UTC.

Change 244601 merged by Ottomata:
Add cron that schedules camus imports for mediawiki Avro Binary data

https://gerrit.wikimedia.org/r/244601

ggellerman renamed this task from Setup pipeline for search logs to travel through kafka and camus into hadoop {hawk} [21 pts] to Setup pipeline for search logs to travel through kafka and camus into hadoop {hawk} [55 pts].Oct 13 2015, 4:09 PM
bd808 added a subscriber: bd808.Oct 14 2015, 5:58 PM
kevinator closed this task as Resolved.Oct 15 2015, 4:02 PM
kevinator added a subscriber: kevinator.