Page MenuHomePhabricator

Outreachy Project Proposal - 5 subprojects with the Reading Department
Closed, ResolvedPublic

Description

Outreachy Application: https://outreachy.gnome.org/?q=view_projects&prg=7&p=1348

Name: Zareen Farooqui
Email: zareenf@gmail.com
IRC or IM networks/handle(s): zareen
Web Page: http://www.zareenfarooqui.com/
Blog: https://medium.com/zareen-farooqui
Github: https://github.com/zareenfarooqui
Location: Somerville, MA, USA
Typical working hours: M-F from 8am -5pm Eastern Standard Time


Synopsis

From my discussions with the Reading Department, the best way for me to contribute would be to work on 5 data analysis subprojects throughout the internship. This will allow me to gain a better understanding of various challenges and projects within the Wikimedia Foundation. The subprojects I would like to work on are below:

subproject 0: Report most frequent section titles on 5 largest Wikipedias - https://phabricator.wikimedia.org/T148260
subproject 1: Readership Metric Report - https://phabricator.wikimedia.org/T148261
subproject 2: Research Report on Wikipedia reader behavior (research question TBD so no Phabricator task created yet)
subproject 3: Vet and Explore Readership Engagement Metric- https://phabricator.wikimedia.org/T148262
subproject 4: Vet and Explore Readership Retention Metric- https://phabricator.wikimedia.org/T148263

Subproject 0 will be an extension of my microtasks for the application process. Subproject 1 and 2 will provide valuable information to the Reading Department about reader behavior. Subprojects 3 and 4 will include contributing to the development and ongoing maintenance of the Reading Department's engagement and retention metrics.

Possible Mentors: Tilman Bayer (@Tbayer )
Possible Mentors: Jon Katz ( @JKatzWMF )

Proposed Schedule and Deliverables

The timeline below is a detailed outline of the subprojects I will complete during my Outreachy Internship with the Wikimedia Foundation. The timeline is approximate and will have to be adjusted throughout the internship as I learn more.

October 17 - December 6 - Community Bonding [subproject 0]

  • Extend the microtask I’ve completed into a report on the most frequent article section titles in English, Spanish, German, Italian and French Wikipedia.
  • Publish these results in the form of a Research: project page on Meta
  • Announce each result on the village pump of the corresponding Wikipedia
  • Get familiar with the existing core metrics used by the Reading Team
  • Get familiar with the existing data sources used by the Reading Team
  • Get basic familiarity with MySQL as installed on the Foundation’s stats servers
  • Get basic familiarity with Hive as installed on the Foundation’s stats servers
  • Read Head First: SQL book to brush up on SQL knowledge (WMF is using MariaDB flavor of MySQL).
  • Use Quarry to write SQL queries on public Wikipedia data
  • Get familiar with the existing analytics tools used by the Reading team
  • Brush up on basic SSH command line usage to prepare for internship (ex: SCPing a file between server and my laptop, using my screen command to launch queries in the background)
  • Read The Wikipedia Revolution to learn about Wikipedia’s inception, its evolution and its growth since 2001
  • Finish reading The Quick Python Book, Second Edition
  • Continue to fine tune project details with mentors and check in regarding current progress
  • Write my 1st blog post about Outreachy, an overview of the WMF, my application process, and an explanation of my subprojects.
  • Complete community bonding period report per Outreachy guidelines

December 6 - December 20 - Onboarding [subproject 1]

  • Use my knowledge of core metrics, data sources and analytics tools to produce the data for, write (in collaboration with my mentor, Tilman) the comments of, and publish at least one edition of the Readership metrics reports. Write at least 2 more editions of this report later in the internship.
  • Meet with relevant team-members and gather needs for subproject 2.
  • Write 1 blog post about my experience with MySQL + Hive (review of tutorial I used to learn, learning curve, how my background SQL knowledge helped)
  • Write 1 blog post about my initial experiences with WMF processes (decisions made in the open via a collaborative and consensus based approach and working remotely with distributed team)
  • Write 1 blog post about my project progress so far (explanation of project, how it was completed and any lessons learned)
  • Daily check-in (15 - 30 minutes): Review progress/timeline with Tilman, discuss roadblocks, review results, discuss next steps
  • Weekly 1:1:1 meeting with Tilman and Jon (who will represent the “customer”) to set/adjust my short term goals

December 20 - January 13 - [subproject 2]

  • Produce at least one report on a research question about Wikipedia reader behavior, to be selected in collaboration with the Reading product managers.
    • Possible example: How is the left sidebar on desktop used?
  • Write 1 blog post on subproject 2 (explanation of project, how it was completed and any lessons learned)
  • Daily check-in (15 - 30 minutes): Review progress/timeline with Tilman, discuss roadblocks, review results, discuss next steps
  • Weekly 1:1:1 meeting with Tilman and Jon (who will represent the “customer”) to set/adjust my short term goals

January 3 - February 3 - [subproject 3]
(I will have overlap between subproject 2 and 3)

  • Vet and explore a new privacy-friendly desktop readership engagement metric (planned to be based on https://phabricator.wikimedia.org/T147314 ), and build a reporting mechanism for it.
  • Deliverables:
    1. (~ 2 weeks) A report examining various possible specifications of this metric (e.g. choice of percentile, etc.), their possible data quality issues and suggestions how to fix or mitigate them, and an assessment of their sensitivity and robustness
    2. (~ 1 week) An exploratory analysis showing how the chosen metric differs across various dimensions, e.g. project language or geographical region
    3. (~ 1 week) A workflow or an automated tool to regularly inform the Reading team and the Wikimedia movement on how this metric is developing
  • January 9 -11: Attend the Wikimedia Developer Summit in San Francisco, CA
  • While I am in SF, possibly give a meetup talk to PyLadies SF chapter - “Exploring Wikipedia with Pandas”. Will need to reach out to PyLadies to discuss event logistics. If this is not possible, at least attend 1 technical meetup while I am in San Francisco.
  • Complete mid-term evaluation with mentors per Outreachy guidelines
  • Write blog post about my time at Wikimedia Developer Summit
  • Write blog post about subproject 3 (explanation of project, how it was completed and any lessons learned)
  • Daily check-in (15 - 30 minutes): Review progress/timeline with Tilman, discuss roadblocks, review results, discuss next steps
  • Weekly 1:1:1 meeting with Tilman and Jon (who will represent the “customer”) to set/adjust my short term goals

February 3 - March 6 [subproject 4]

  • Vet and explore a new privacy-friendly web readership retention metric (based on a data source selected by the Reading team), and build a reporting mechanism for it.
  • Deliverables:
    1. (~ 2 weeks) A report examining various possible specifications of this metric (e.g. choice of percentile, etc.), their possible data quality issues and suggestions how to fix or mitigate them, and an assessment of their sensitivity and robustness
    2. (~ 1 week) An exploratory analysis showing how the chosen metric differs across various dimensions, e.g. project language or geographical region
    3. (~ 1 week) A workflow or an automated tool to regularly inform the Reading team and the Wikimedia movement on how this metric is developing
  • Write blog post about subproject 4 (explanation of project, how it was completed and any lessons learned)
  • Write blog post about wrap up of internship, my overall experience of the program, and any advice for future interns.
  • Daily check-in (15 - 30 minutes): Review progress/timeline with Tilman, discuss roadblocks, review results, discuss next steps
  • Weekly 1:1:1 meeting with Tilman and Jon (who will represent the “customer”) to set/adjust my short term goals
  • Make sure all my work is documented for easy access for all contributors
  • Discuss possible extensions of projects or additional work I can contribute to WMF after internship period ends
  • Complete final evaluation report per Outreachy guidelines

Participation

I plan to start working on my projects for this internship immediately as I am available now. I have been communicating daily with my mentors via IRC and email. Once the internship starts, I plan to spend 15-30 minutes checking in with Tilman to review my progress and timeline, review results, and discuss next steps. I plan to have a weekly 1:1:1 with Tilman and Jon to set and adjust my short term goals. We will mainly communicate through IRC/email/Google Hangouts. I plan to release weekly reports with my progress and will write ~9 blog posts during my internship.

Although I plan to work mostly remotely, I would like to attend the Wikipedia Developer Summit in January to meet the team in person. Also, if possible, I would also like to work from the WMF headquarters for 1-2 weeks during the internship.

About me

Education and Work Experience
University of Florida- B.S. in Industrial & Systems Engineering, December 2012
University of Florida- Minor in Statistics, December 2012

Lean Manufacturing Intern: 787 Program - The Boeing Company, May 2011 to August 2011
Operational Excellence Intern - Eaton Corporation, May 2012 to August 2012
Leadership Development Program for Technical Sales - Eaton Corporation, January 2013 to November 2013
Sales Engineer - Eaton Corporation, December 2013 to July 2015
Account Executive - Experian Data Quality, July 2015 to February 2016

How did I learn about Outreachy?
I quit my technical sales job in February 2016 to start learning how to code on my own. Although I had a “good job”, I was not interested in what I was doing and felt like I was not learning enough, which was a very frustrating experience. Since then, I’ve been learning Python, SQL, basic web development, machine learning algorithms, statistics, data visualization and other technical topics on my own via books, tutorials and hacking on my own projects. This has been an extremely exciting time period as I’ve been completely self-motivated and been able to learn any topic I was interested in.

My brother sent me a link to the Gnome Outreachy program and I was immediately interested. Since I’ve started coding, I’ve released all my projects under the Creative Commons license (open-source). I believe that I am an ideal candidate for the Outreachy Internship Program.

Time commitment during internship period
I have no other commitments during this time and can contribute at least 40 hours per week to this internship.

Past experience

Open source community
I use FOSS projects everyday: Python (and many of its libraries), Git, Jupyter Notebook, Linux and SQLite.

This year, I started contributing to my first open-source project, Open House, a project with the objective of gathering, processing and exploring home purchasing data. I’ve researched the current state of open-source housing data and worked on front-end data visualizations in d3. I came across the Open House project because I needed housing data for an analysis of the Boston housing market. When I couldn’t find any data sources at the granularity I needed (down to each individual home), I decided to start working with Open House to help make it happen.

Relevant Project Work
Since I started coding, I’ve been very interested in natural language processing, text analytics and reading pattern analysis. I believe my skills will align with the Reading Department at WMF.

Harry Potter Text Analysis
Code
Blog post
I wrote natural language processing code from scratch for sentiment analysis, punctuation analysis, finding N-Grams, and Character Relationship Analysis for the Harry Potter series

Word2Vec on Harry Potter
Code
Blog Post 1
Blog Post 2
Try it here!
I built a web application for users to enter a word from the Harry Potter series and find the 7 most similar words from the series

Can Wikipedia Predict the 2016 US Presidential Election?
Code
Blog Post
I used the Wikipedia clickstream data sets to predict the success of each candidate and analyze which social media platforms drive the most traffic to Wikipedia articles

Internet in Palau
Blog post
This was a mini-research project of mine to learn about internet speed and internet access in the country of Palau. This is very relevant to WMF because the New Readers team there is studying how users across the globe, specifically in places with scarce internet access, are consuming Wikipedia.

Any other info

I have been consuming Wikipedia content for over 10 years now. Initially, I used Wikipedia for school research projects. Now, I use Wikipedia almost daily to learn about new topics and news stories.

My first contribution to Wikimedia Foundation was in March 2016 when I was working on a text analysis of Harry Potter and couldn’t find a list of Harry Potter characters in Wikidata. I decided to add the list of characters myself and immediately got a warning message to stop what I was doing. Although I was contributing in good faith, this information already existed and I was not accessing it correctly. This was a huge learning experience as I discovered how to interact with the Wikipedian community and started learning about their processes. I was able to successfully use the Wikidata API to programmatically populate a list of all the characters within the Harry Potter Universe using a SPARQL query.

Throughout 2016, I have been a teaching assistant for 4 “Exploring Wikipedia with Spark” courses as a contractor for Databricks. My responsibilities included tier 1 tech support and answering student questions. By TAing this class, I’ve explored public datasets the Wikimedia Foundation releases such as hourly pageviews and monthly clickstreams. I’ve been able to analyze mobile vs. desktop traffic, determine the most popular Wikipedia articles in a given time period, and find out what external sources drive the most traffic to Wikipedia.

In July 2016, I used the clickstream datasets to complete an analysis of my own and determine if Wikipedia could predict the 2016 Presidential Election.

In September 2016, I gave my first Meetup talk to the Boston PyLadies chapter. This was a 2 hour tutorial on the Pandas data analysis library using the Wikipedia pageviews and clickstream data.

In October 2016, I completed 2 microtasks for the Reading Department at WMF. (The microtask had been motivated by a feature the Teading team has been developing to display a "Related Articles" section below articles. They wanted to know how many Wikipedia articles already contain a "See Also" section since it serves a somewhat similar function.) Here is my code.

  • Task 1 included generating a list of the 100 most frequent section titles found in English Wikipedia pages, along with their frequency. This will show how often a “See Also” section already occurs.
  • Task 2 was an exercise to filter out all non-articles (talk pages or help pages) from this dataset and generating a list of the 100 most frequent section titles found in English Wikipedia articles, along with the percentage of articles which contain that section.
    • For the purpose of making this a feasible microtask for my application, it was posed with the simplifying assumption that all non-article pages contain “/” or “.”. This assumption was informed by the guess that the bulk of these pages are mostly short article talk pages with no sections, or administrative pages which discuss individual articles or accounts, organized as /subpages, or media pages which contain a file ending such as “.JPG” in their name. So filtering out any “/” or “.” gets rid of most non-articles.
    • However, since we were making an assumption, this method wasn’t completely accurate. I started with ~6.3 million unique articles. When I filtered out all page titles which contained “/” or “.”, I ended up with ~4.6 million unique articles. This dataset is from April 2016 and at that time English Wikipedia had ~5.1 million unique articles, so I was losing 400-500K actual articles with this method.
    • I proposed an alternative approach to generate a Python list of all 6.3 million unique Page IDs, write a loop to make an API call to the Wikipedia API using urllib3, and get back the namespace field. If namespace = 0, this is an article. Otherwise, it can be discarded for this analysis. I started this approach and timed how long it would take for 6.3 million articles. The Wikipedia API Etiquette requests users to make requests in series (one at a time), rather than in parallel (multiple requests at once) and at ~.25 seconds per request, it would take approximately 18 days for this to complete. This is not a feasible time period, so I completed the task using the simplified approach.

Event Timeline

Zareenf created this task.Oct 15 2016, 3:37 PM
Zareenf updated the task description. (Show Details)
Zareenf updated the task description. (Show Details)Oct 15 2016, 3:40 PM
Zareenf updated the task description. (Show Details)Oct 15 2016, 3:43 PM
Zareenf updated the task description. (Show Details)Oct 15 2016, 3:48 PM
Zareenf updated the task description. (Show Details)Oct 16 2016, 1:17 PM
Sumit added a subscriber: Sumit.Oct 16 2016, 9:02 PM

Hi Zareen,
Thanks for the application! A trivia - you can also add the mode of communication/meeting with mentors.
The timeline in general looks good. Make sure that you've submmited this version of your application at https://outreachy.gnome.org/?q=program_home&prg=7 before Oct-17 7PM UTC. This is important.

Sumit added a subscriber: JKatzWMF.Oct 16 2016, 9:03 PM
Zareenf updated the task description. (Show Details)Oct 17 2016, 12:56 PM
Zareenf updated the task description. (Show Details)Oct 18 2016, 9:54 PM
Zareenf updated the task description. (Show Details)Oct 20 2016, 3:50 PM

Thank you for your proposal. Please add a link to your outreachy proposal in https://outreachy.gnome.org on top of your project proposal. This can be something like Outreachy Link: https://outreachy.gnome.org/?q=view_projects&prg=7&p=xxxx or similar Good Luck!

Zareenf updated the task description. (Show Details)Oct 20 2016, 5:48 PM
Tbayer updated the task description. (Show Details)Oct 25 2016, 1:50 AM

This looks great (on second read I have made some edits, mostly to link to Jon's and my Phabricator accounts and for precision in the description of the context regarding the Reading team). We are going to follow up on the Outreachy site too as requested.

Tbayer moved this task from Triage to Backlog on the Product-Analytics board.Jun 21 2018, 8:23 PM
Tbayer closed this task as Resolved.Jul 12 2018, 11:16 PM

This was a successful internship where @Zareenf did very valuable work in various areas. We should long ago have updated this ticket with details about this work, listing the various tasks that got done - but we haven't gotten around to that in a while, so I'm closing this ticket for now to reflect the conclusion of the internship.