# Discovery Datasets
## Task 1 (6 points)
In T134301, we looked at the distribution of session lengths of Wikipedia.org Portal visitors in May 2016, using the event logging data collected through [[ https://meta.wikimedia.org/wiki/Schema:WikipediaPortal | this schema ]]. The codebase for this analysis is [[ https://github.com/wikimedia-research/Discovery-Research-Portal/tree/master/Analyses/Session%20Length | on GitHub ]]. The first draft of [[ https://github.com/wikimedia-research/Discovery-Research-Portal/blob/master/Analyses/Session%20Length/report.pdf | the report ]] never got properly finished (published to [[ https://commons.wikimedia.org/wiki/Main_Page | Commons ]] like [[ https://commons.wikimedia.org/wiki/File:Report_on_Cirrus_Search_TextCat_AB_Test_-_Language_Detection_on_English,_French,_Spanish,_Italian,_and_German_Wikipedias.pdf | some ]] [[ https://commons.wikimedia.org/wiki/File:Wikipedia_Portal_Test_of_Language_Detection_and_Primary_Link_Resorting.pdf | other ]] [[ https://commons.wikimedia.org/wiki/File:From_Zero_to_Hero_-_Anticipating_Zero_Results_From_Query_Features,_Ignoring_Content.pdf | reports ]]).
Your task is to reproduce the analysis but for June and July data (event logging keeps a 90-day backlog). If you can figure out additional insights that were not in the original report, great! You could even see if the language detection deployment (T133432) on June 2nd had an effect on session lengths.
Feel free to make a folder called "Session Lengths v2" in https://github.com/wikimedia-research/Discovery-Research-Portal/tree/master/Analyses to store your code and report in.
## Task 2 (6 points)
This task will be for learning the web requests data.
## Task 3 (6 points)
This task will be for learning the cirrus search requests data.
## Additional Information
We have three databases/tables of interest:
- [[ https://wikitech.wikimedia.org/wiki/Analytics/EventLogging | Event logging ]] (what you will use for Task 1)
- Once ssh'd into stat1002 or stat1003: run `mysql -h analytics-store.eqiad.wmnet` to open the mysql command line interface
- In R (on stat1002), install our internal "[[ https://github.com/wikimedia/wikimedia-discovery-wmf/ | wmf ]]" package and use `wmf::build_query()` to execute queries and get that data into R
- Install **wmf** via `devtools::install_git('https://gerrit.wikimedia.org/r/wikimedia/discovery/wmf')`
- [[ https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest | Webrequests ]] (accessed via [[ https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive | Hive ]])
- [[ https://wikitech.wikimedia.org/wiki/Analytics/Data/Cirrus | Cirrus searches ]] (also accessed via Hive)
For more info see https://meta.wikimedia.org/wiki/Discovery/Analytics#Databases_and_Datasets
This task will be split up into 3 sub-tasks, each for learning one of those databases/tables:
- T143137 is for learning event logging data.
- T_____ will be for learning web requests data.
- T_____ will be for learning cirrus search requests data.