We have three databases/tables of interest:
- Event logging (what you will use for Task 1)
- Once ssh'd into stat1002 or stat1003: run mysql -h analytics-store.eqiad.wmnet to open the mysql command line interface
- In R (on stat1002), install our internal "wmf" package and use wmf::build_query() to execute queries and get that data into R
- Install wmf via devtools::install_git('https://gerrit.wikimedia.org/r/wikimedia/discovery/wmf')
- Webrequests (accessed via Hive)
- Cirrus searches (also accessed via Hive)
You need to know what they are, their structures, how to operate within them (e.g. aggregations and UDFs inside of Hive), and how to get data out of them for analysis with R/Python/whatever because like 99% of your job description will require you to use this data. So thorough knowledge is very important. For more info on these db's, see: https://meta.wikimedia.org/wiki/Discovery/Analytics#Databases_and_Datasets
This task will be split up into 3 sub-tasks, each for learning one of those databases/tables: