Page MenuHomePhabricator

[EPIC] Learn about our databases and how to use them
Closed, ResolvedPublic

Description

We have three databases/tables of interest:

  • Event logging (what you will use for Task 1)
    • Once ssh'd into stat1002 or stat1003: run mysql -h analytics-store.eqiad.wmnet to open the mysql command line interface
    • In R (on stat1002), install our internal "wmf" package and use wmf::build_query() to execute queries and get that data into R
    • Install wmf via devtools::install_git('https://gerrit.wikimedia.org/r/wikimedia/discovery/wmf')
  • Webrequests (accessed via Hive)
  • Cirrus searches (also accessed via Hive)

You need to know what they are, their structures, how to operate within them (e.g. aggregations and UDFs inside of Hive), and how to get data out of them for analysis with R/Python/whatever because like 99% of your job description will require you to use this data. So thorough knowledge is very important. For more info on these db's, see: https://meta.wikimedia.org/wiki/Discovery/Analytics#Databases_and_Datasets

This task will be split up into 3 sub-tasks, each for learning one of those databases/tables:

  • T143137 is for learning event logging data.
  • T143762 is for learning web requests data and working with Hive/Hadoop.
  • T147216 is for learning cirrus search requests data.

Event Timeline

mpopov renamed this task from Learn about our databases and how to use them to [EPIC] Learn about our databases and how to use them.Aug 16 2016, 6:16 PM
mpopov updated the task description. (Show Details)
mpopov changed the point value for this task from 6 to 18.
mpopov removed the point value for this task.
debt triaged this task as Medium priority.Aug 16 2016, 8:15 PM