**Name**: Zareen Farooqui
**Email**: zareenf@gmail.com
**IRC or IM networks/handle(s)**: zareen
**Web Page**: http://www.zareenfarooqui.com/
**Blog**: https://medium.com/zareen-farooqui
**Github**: https://github.com/zareenfarooqui
**Location**: Somerville, MA, USA
**Typical working hours**: M-F from 8am -5pm Eastern Standard Time
---
= Synopsis =
From my discussions with the Reading Department, the best way for me to contribute would be to work on 5 data analysis subprojects throughout the internship. This will allow me to gain a better understanding of various challenges and projects within the Wikimedia Foundation. The subprojects I would like to work on are below:
subproject 0: Report most frequent section titles on 5 largest Wikipedias - https://phabricator.wikimedia.org/T148260
subproject 1: Readership Metric Report - https://phabricator.wikimedia.org/T148261
subproject 2: Research Report on Wikipedia reader behavior (research question TBD so no Phabricator task created yet)
subproject 3: Vet and Explore Readership Engagement Metric- https://phabricator.wikimedia.org/T148262
subproject 4: Vet and Explore Readership Retention Metric- https://phabricator.wikimedia.org/T148263
Subproject 0 will be an extension of my microtasks for the application process. Subproject 1 and 2 will provide valuable information to the Reading Department about reader behavior. Subprojects 3 and 4 will include contributing to the development and ongoing maintenance of the Reading Department's engagement and retention metrics.
**Possible Mentors**: Tilman Bayer, HaeB
**Possible Mentors**: Jon Katz, JonKatz
= Proposed Schedule and Deliverables =
The timeline below is a detailed outline of the subprojects I will complete during my Outreachy Internship with the Wikimedia Foundation. The timeline is approximate and will have to be adjusted throughout the internship as I learn more.
**October 17 - December 6 - Community Bonding [subproject 0]**
* Extend the microtask I’ve completed into a report on the most frequent article section titles in English, Spanish, German, Italian and French Wikipedia.
* Publish these results in the form of a [[https://meta.wikimedia.org/wiki/Research:Index|Research]]: project page on Meta
* https://meta.wikimedia.org/wiki/Research:Investigate_frequency_of_section_titles_in_5_large_Wikipedias
* Announce each result on the [[https://en.wikipedia.org/wiki/Wikipedia:Village_pump|village pump]] of the corresponding Wikipedia
* Get familiar with the existing core metrics used by the Reading Team
* Get familiar with the existing data sources used by the Reading Team
* Get basic familiarity with MySQL as installed on the Foundation’s stats servers
* Get basic familiarity with Hive as installed on the Foundation’s stats servers
* Read [[http://www.headfirstlabs.com/books/hfsql/|Head First: SQL]] book to brush up on SQL knowledge (WMF is using MariaDB flavor of MySQL).
* Use [[https://meta.wikimedia.org/wiki/Research:Quarry|Quarry]] to write SQL queries on public Wikipedia data
* Get familiar with the existing analytics tools used by the Reading team
* Brush up on basic SSH command line usage to prepare for internship (ex: SCPing a file between server and my laptop, using my screen command to launch queries in the background)
* Read [[https://www.amazon.com/Wikipedia-Revolution-Nobodies-Greatest-Encyclopedia/dp/1401303714| The Wikipedia Revolution]] to learn about Wikipedia’s inception, its evolution and its growth since 2001
* Finish reading [[https://www.manning.com/books/the-quick-python-book-second-edition|The Quick Python Book, Second Edition]]
* Continue to fine tune project details with mentors and check in regarding current progress
* Write my 1st blog post about Outreachy, an overview of the WMF, my application process, and an explanation of my subprojects.
* Complete community bonding period report per Outreachy guidelines
**December 6 - December 20 - Onboarding [subproject 1]**
* Use my knowledge of core metrics, data sources and analytics tools to produce the data for, write (in collaboration with my mentor, Tilman) the comments of, and publish at least one edition of the Readership metrics reports. Write at least 2 more editions of this report later in the internship.
* Meet with relevant team-members and gather needs for subproject 2.
* Write 1 blog post about my experience with MySQL + Hive (review of tutorial I used to learn, learning curve, how my background SQL knowledge helped)
* Write 1 blog post about my initial experiences with WMF processes (decisions made in the open via a collaborative and consensus based approach and working remotely with distributed team)
* Write 1 blog post about my project progress so far (explanation of project, how it was completed and any lessons learned)
* Daily check-in call (15 - 30 minutes): Review progress/timeline with Tilman, discuss roadblocks, review results, discuss next steps
* Weekly 1:1:1 meeting with Tilman and Jon (who will represent the “customer”) to set/adjust my short term goals
**December 20 - January 13 - [subproject 2]**
* Produce at least one report on a research question about Wikipedia reader behavior, to be selected in collaboration with the Reading product managers.
* Possible example: How is the left sidebar on desktop used?
* Write 1 blog post on subproject 2 (explanation of project, how it was completed and any lessons learned)
* Daily check-in call (15 - 30 minutes): Review progress/timeline with Tilman, discuss roadblocks, review results, discuss next steps
* Weekly 1:1:1 meeting with Tilman and Jon (who will represent the “customer”) to set/adjust my short term goals
**January 3 - February 3 - [subproject 3]**
(I will have overlap between subproject 2 and 3)
* Vet and explore a new privacy-friendly desktop readership engagement metric (planned to be based on https://phabricator.wikimedia.org/T147314 ), and build a reporting mechanism for it.
* Deliverables:
1. (~ 2 weeks) A report examining various possible specifications of this metric (e.g. choice of percentile, etc.), their possible data quality issues and suggestions how to fix or mitigate them, and an assessment of their sensitivity and robustness
2. (~ 1 week) An exploratory analysis showing how the chosen metric differs across various dimensions, e.g. project language or geographical region
3. (~ 1 week) A workflow or an automated tool to regularly inform the Reading team and the Wikimedia movement on how this metric is developing
* January 9 -11: Attend the Wikimedia Developer Summit in San Francisco, CA
* While I am in SF, possibly give a meetup talk to PyLadies SF chapter - “Exploring Wikipedia with Pandas”. Will need to reach out to PyLadies to discuss event logistics. If this is not possible, at least attend 1 technical meetup while I am in San Francisco.
* Complete mid-term evaluation with mentors per Outreachy guidelines
* Write blog post about my time at Wikimedia Developer Summit
* Write blog post about subproject 3 (explanation of project, how it was completed and any lessons learned)
* Daily check-in call (15 - 30 minutes): Review progress/timeline with Tilman, discuss roadblocks, review results, discuss next steps
* Weekly 1:1:1 meeting with Tilman and Jon (who will represent the “customer”) to set/adjust my short term goals
**February 3 - March 6 [subproject 4]**
* Vet and explore a new privacy-friendly web readership retention metric (based on a data source selected by the Reading team), and build a reporting mechanism for it.
* Deliverables:
1. (~ 2 weeks) A report examining various possible specifications of this metric (e.g. choice of percentile, etc.), their possible data quality issues and suggestions how to fix or mitigate them, and an assessment of their sensitivity and robustness
2. (~ 1 week) An exploratory analysis showing how the chosen metric differs across various dimensions, e.g. project language or geographical region
3. (~ 1 week) A workflow or an automated tool to regularly inform the Reading team and the Wikimedia movement on how this metric is developing
* Write blog post about subproject 4 (explanation of project, how it was completed and any lessons learned)
* Write blog post about wrap up of internship, my overall experience of the program, and any advice for future interns.
* Daily check-in call (15 - 30 minutes): Review progress/timeline with Tilman, discuss roadblocks, review results, discuss next steps
* Weekly 1:1:1 meeting with Tilman and Jon (who will represent the “customer”) to set/adjust my short term goals
* Make sure all my work is documented for easy access for all contributors
* Discuss possible extensions of projects or additional work I can contribute to WMF after internship period ends
* Complete final evaluation report per Outreachy guidelines
= Participation =
I plan to start working on my projects for this internship immediately as I am available now. I have been communicating daily with my mentors via IRC and email. Once the internship starts, I plan to spend 15-30 minutes checking in with Tilman to review my progress and timeline, review results, and discuss next steps. I plan to have a weekly 1:1:1 with Tilman and Jon to set and adjust my short term goals. We will mainly communicate through IRC/email/Google Hangouts. I plan to release weekly reports with my progress and will write ~9 blog posts during my internship.
Although I plan to work mostly remotely, I would like to attend the Wikipedia Developer Summit in January to meet the team in person. Also, if possible, I would also like to work from the WMF headquarters for 1-2 weeks during the internship.
= About me =
**Education and Work Experience**
University of Florida- B.S. in Industrial & Systems Engineering, December 2012
University of Florida- Minor in Statistics, December 2012
Lean Manufacturing Intern: 787 Program - The Boeing Company, May 2011 to August 2011
Operational Excellence Intern - Eaton Corporation, May 2012 to August 2012
Leadership Development Program for Technical Sales - Eaton Corporation, January 2013 to November 2013
Sales Engineer - Eaton Corporation, December 2013 to July 2015
Account Executive - Experian Data Quality, July 2015 to February 2016
**How did I learn about Outreachy?**
I quit my technical sales job in February 2016 to start learning how to code on my own. Although I had a “good job”, I was not interested in what I was doing and felt like I was not learning enough, which was a very frustrating experience. Since then, I’ve been learning Python, SQL, basic web development, machine learning algorithms, statistics, data visualization and other technical topics on my own via books, tutorials and hacking on my own projects. This has been an extremely exciting time period as I’ve been completely self-motivated and been able to learn any topic I was interested in.
My brother sent me a link to the Gnome Outreachy program and I was immediately interested. Since I’ve started coding, I’ve released all my projects under the Creative Commons license (open-source). I believe that I am an ideal candidate for the Outreachy Internship Program.
**Time commitment during internship period**
I have no other commitments during this time and can contribute at least 40 hours per week to this internship.
= Past experience =
**Open source community**
I use FOSS projects everyday: Python (and many of its libraries), Git, Jupyter Notebook, Linux and SQLite.
This year, I started contributing to my first open-source project, [[http://openhouseproject.co/|Open House]], a project with the objective of gathering, processing and exploring home purchasing data. I’ve researched the [[http://openhouseproject.co/docs/|current state of open-source housing data]] and worked on front-end data visualizations in d3. I came across the Open House project because I needed housing data for an analysis of the Boston housing market. When I couldn’t find any data sources at the granularity I needed (down to each individual home), I decided to start working with Open House to help make it happen.
**Relevant Project Work**
Since I started coding, I’ve been very interested in natural language processing, text analytics and reading pattern analysis. I believe my skills will align with the Reading Department at WMF.
__Harry Potter Text Analysis__
[[https://github.com/zareenfarooqui/harry-potter-text-analysis/blob/master/HP_text_analysis.ipynb| Code]]
[[https://medium.com/zareen-farooqui/harry-potter-text-analysis-4d89ffe59d5b#.s7isf1sh4|Blog post]]
I wrote natural language processing code from scratch for sentiment analysis, punctuation analysis, finding N-Grams, and Character Relationship Analysis for the Harry Potter series
__Word2Vec on Harry Potter__
[[https://github.com/zareenfarooqui/w2v-on-harry-potter|Code]]
[[https://medium.com/zareen-farooqui/word2vec-on-harry-potter-e3748ab27e9#.nm3o5agpk|Blog Post 1]]
[[https://medium.com/zareen-farooqui/behind-the-scenes-of-word2vec-on-harry-potter-148770977e25#.ahl5nv3ox|Blog Post 2]]
[[http://www.zareenfarooqui.com/w2v/|Try it here! ]]
I built a web application for users to enter a word from the Harry Potter series and find the 7 most similar words from the series
__Can Wikipedia Predict the 2016 US Presidential Election?__
[[https://github.com/zareenfarooqui/2016-presidential-election-analysis/blob/master/2016_Presidential_Candidate_Clickstream_Analysis.ipynb| Code]]
[[https://medium.com/zareen-farooqui/can-wikipedia-predict-the-2016-us-presidential-election-af0714eb3d81#.ewrymubmc|Blog Post]]
I used the Wikipedia clickstream data sets to predict the success of each candidate and analyze which social media platforms drive the most traffic to Wikipedia articles
__Internet in Palau__
[[https://medium.com/zareen-farooqui/internet-in-palau-7151e4893e52#.rcslz0u8r|Blog post]]
This was a mini-research project of mine to learn about internet speed and internet access in the country of Palau. This is very relevant to WMF because the New Readers team there is studying how users across the globe, specifically in places with scarce internet access, are consuming Wikipedia.
= Any other info =
I have been consuming Wikipedia content for over 10 years now. Initially, I used Wikipedia for school research projects. Now, I use Wikipedia almost daily to learn about new topics and news stories.
My first contribution to Wikimedia Foundation was in March 2016 when I was working on a [[https://medium.com/zareen-farooqui/harry-potter-text-analysis-4d89ffe59d5b#.k26rigzcf|text analysis of Harry Potter]] and couldn’t find a list of Harry Potter characters in Wikidata. I decided to add the list of characters myself and immediately got a [[https://www.wikidata.org/wiki/User_talk:Zareenf#Stop|warning message]] to stop what I was doing. Although I was contributing in good faith, this information already existed and I was not accessing it correctly. This was a huge learning experience as I discovered how to interact with the Wikipedian community and started learning about their processes. I was able to successfully use the Wikidata API to programmatically populate a list of all the characters within the Harry Potter Universe using a SPARQL query.
Throughout 2016, I have been a teaching assistant for 4 [[https://www.youtube.com/watch?v=vlVnSpJ6TDE|“Exploring Wikipedia with Spark”]] courses as a contractor for Databricks. My responsibilities included tier 1 tech support and answering student questions. By TAing this class, I’ve explored public datasets the Wikimedia Foundation releases such as hourly pageviews and monthly clickstreams. I’ve been able to analyze mobile vs. desktop traffic, determine the most popular Wikipedia articles in a given time period, and find out what external sources drive the most traffic to Wikipedia.
In July 2016, I used the clickstream datasets to complete an analysis of my own and determine if [[https://medium.com/zareen-farooqui/can-wikipedia-predict-the-2016-us-presidential-election-af0714eb3d81#.abvi9zao3|Wikipedia could predict the 2016 Presidential Election.]]
In September 2016, I gave my first [[https://www.meetup.com/PyLadies-Boston/events/229183546/|Meetup talk]] to the Boston PyLadies chapter. This was a 2 hour [[https://github.com/zareenfarooqui/pandas_tutorial_pyladies|tutorial]] on the Pandas data analysis library using the Wikipedia pageviews and clickstream data.
In October 2016, I completed [[https://docs.google.com/document/d/1-YKnycqVsodwiOTS9dgcqX4c3V5qhA3LjdolXKomeE0/edit?usp=sharing| 2 microtasks]] for the Reading Department at WMF. The reading team has been developing a feature to display a "Related Articles" section below articles. They wanted to know how many Wikipedia articles already contain a "See Also" section since it serves a similar function. The microtask assignments I completed serve as exercise toward the first steps of developing this feature. Here is my [[https://github.com/zareenfarooqui/wmf-outreachy-microtask/blob/master/WMF%20-%20Outreachy%20Microtask.ipynb| code]].
* Task 1 included generating a list of the 100 most frequent section titles found in English Wikipedia articles, along with their frequency. This will show how often a “See Also” section already occurs.
* Task 2 was an exercise to filter out all non-articles (talk pages or help pages) from this dataset and generating a list of the 100 most frequent section titles found in English Wikipedia articles, along with the percentage of articles which contain that section.
* For the purpose of making this a feasible task for my application, we made the simple assumption that all non-article pages contain “/” or “.”. This assumption was informed by the guess that the bulk of these pages are mostly short article talk pages with no sections or administrative pages which discuss individual articles or accounts. They are organized as /subpages or media pages which contain a file ending in “.JPG” in their name, so filtering out any “/” or “.” gets rid of most non-articles.
* However, since we were making an assumption, this method wasn’t completely accurate. I started with ~6.3 million unique articles. When I filtered out all page titles which contained “/” or “.”, I ended up with ~4.6 million unique articles. This dataset is from April 2016 and at that time English Wikipedia had ~5.1 million unique articles, so I was losing 400-500K actual articles with this method.
* I proposed an [[https://github.com/zareenfarooqui/wmf-outreachy-microtask/blob/master/WMF%20-%20Outreachy%20Bonus%20Task%20-%20Approach%202.ipynb| alternative approach]] to generate a Python list of all 6.3 million unique Page IDs, write a loop to make an API call to the Wikipedia API using urllib3, and get back the namespace field. If namespace = 0, this is an article. Otherwise, it can be discarded for this analysis. I started this approach and timed how long it would take for 6.3 million articles. The Wikipedia API Etiquette requests users to make requests in series (one at a time), rather than in parallel (multiple requests at once) and at ~.25 seconds per request, it would take approximately 18 days for this to complete. This is not a feasible time period, so I completed the task using the simplified approach.