# Discovery Datasets
## Task 1
### Background
In T134301, we looked at the distribution of session lengths of Wikipedia.org Portal visitors in May 2016, using the event logging data collected through [[ https://meta.wikimedia.org/wiki/Schema:WikipediaPortal | this schema ]]. The codebase for this analysis is [[ https://github.com/wikimedia-research/Discovery-Research-Portal/tree/master/Analyses/Session%20Length | on GitHub ]]. The first draft of [[ https://github.com/wikimedia-research/Discovery-Research-Portal/blob/master/Analyses/Session%20Length/report.pdf | the report ]] never got properly finished (published to [[ https://commons.wikimedia.org/wiki/Main_Page | Commons ]] like [[ https://commons.wikimedia.org/wiki/File:Report_on_Cirrus_Search_TextCat_AB_Test_-_Language_Detection_on_English,_French,_Spanish,_Italian,_and_German_Wikipedias.pdf | some ]] [[ https://commons.wikimedia.org/wiki/File:Wikipedia_Portal_Test_of_Language_Detection_and_Primary_Link_Resorting.pdf | other ]] [[ https://commons.wikimedia.org/wiki/File:From_Zero_to_Hero_-_Anticipating_Zero_Results_From_Query_Features,_Ignoring_Content.pdf | reports ]]).
### Objective
Your task is to reproduce the analysis but for June and July data (event logging keeps a 90-day backlog). If you can figure out additional insights that were not in the original report, great! You could even see if the language detection deployment (T133432) on June 2nd had an effect on session lengths.
Feel free to make a folder called "Session Lengths v2" in https://github.com/wikimedia-research/Discovery-Research-Portal/tree/master/Analyses to store your code and report in.
## Additional Information
We have three databases/tables of interest:
- [[ https://wikitech.wikimedia.org/wiki/Analytics/EventLogging | Event logging ]] (what you will use for Task 1)
- Once ssh'd into stat1002 or stat1003: run `mysql -h analytics-store.eqiad.wmnet` to open the mysql command line interface
- In R (on stat1002), install our internal "[[ https://github.com/wikimedia/wikimedia-discovery-wmf/ | wmf ]]" package and use `wmf::build_query()` to execute queries and get that data into R
- Install **wmf** via `devtools::install_git('https://gerrit.wikimedia.org/r/wikimedia/discovery/wmf')`
- [[ https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest | Webrequests ]] (accessed via [[ https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive | Hive ]])
- [[ https://wikitech.wikimedia.org/wiki/Analytics/Data/Cirrus | Cirrus searches ]] (also accessed via Hive)
For more info see https://meta.wikimedia.org/wiki/Discovery/Analytics#Databases_and_Datasets