We have three databases/tables of interest:
- [[ https://wikitech.wikimedia.org/wiki/Analytics/EventLogging | Event logging ]] (what you will use for Task 1)
- Once ssh'd into stat1002 or stat1003: run `mysql -h analytics-store.eqiad.wmnet` to open the mysql command line interface
- In R (on stat1002), install our internal "[[ https://github.com/wikimedia/wikimedia-discovery-wmf/ | wmf ]]" package and use `wmf::build_query()` to execute queries and get that data into R
- Install **wmf** via `devtools::install_git('https://gerrit.wikimedia.org/r/wikimedia/discovery/wmf')`
- [[ https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest | Webrequests ]] (accessed via [[ https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive | Hive ]])
- [[ https://wikitech.wikimedia.org/wiki/Analytics/Data/Cirrus | Cirrus searches ]] (also accessed via Hive)
**//You need to know what they are, their structures, how to operate within them (e.g. aggregations and UDFs inside of Hive), and how to get data out of them for analysis with R/Python/whatever because like 99% of your job description will require you to use this data.//** So thorough knowledge is very important. For more info on these db's, see: https://meta.wikimedia.org/wiki/Discovery/Analytics#Databases_and_Datasets
This task will be split up into 3 sub-tasks, each for learning one of those databases/tables:
- T143137 is for learning event logging data.
- T143762 is for learning web requests data and working with Hive/Hadoop.
- T_____ will be for learning cirrus search requests data.