Page MenuHomePhabricator

Outreachy Round 22: Use PAWS to create a series of notebook based tutorials that help users access and work with data on Wikimedia projects
Open, Needs TriagePublic


IMPORTANT: Make sure to read the Outreachy participant instructions and communication guidelines thoroughly before commenting on this task. This space is for project-specific questions, so avoid asking questions about getting started, setting up Gerrit, etc. When in doubt, ask your question on Zulip first!

Brief summary

Wikimedia projects produce a lot of interesting data! The purpose of this project is to create a set of notebook-based tutorials and assets that will make it easier for individuals to access and use that data.

The primary focus of this project is improving technical documentation. The participant will engage technical writing, research, and programming skills while working on the following outcomes:

  • Write a library that could work with SQL dumps
  • A notebook tutorial that helps users decide between PAWS and Toolforge for their work with datasets and why (What works on PAWS? What works on Toolforge?)
  • A notebook tutorial on dumps that shows accessing dumps in XML and SQL
  • Propose and draft additional notebook tutorials focused on improving the experience of users working with Wikimedia data

Skills required

  • Python 3, SQL, JSON
  • Jupyter notebooks
  • Technical documentation
  • Research

Possible mentor(s)

@srodlund @Isaac



Event Timeline

srishakatux changed the visibility from "Public (No Login Required)" to "Outreachy Mentors (Project)".
srishakatux changed the visibility from "Outreachy Mentors (Project)" to "Public (No Login Required)".Mar 29 2021, 4:17 PM

Hello, i would love to work on this task. can i please work it?

@Esther.Osayande: Hi and welcome. This is an Outreach program task. Please see the links in the task description first. Thanks.

Hi, am an outreachy applicant, i have gone through the description, i would love to research on the task and work on it. Thanks

Hey all -- I've gotten a few questions about the "Write a library that could work with SQL dumps" part of the outcomes so I wanted to give a few more details:

You all have gotten some experience extracting information from the .sql.gz dumps by now -- i.e. extracting the edit tag ID for mobile edit and all the revision IDs that were associated with that tag. You have also gotten some experience working with the mwxml library for the history dumps. You likely noticed that working with the history dumps was far easier than working with the .sql.gz dumps -- namely you had a library that allowed you to simplify iterate through those dumps and extract / filter information based on a few set attributes like namespace or title. The goal of the library mentioned in the outcomes would be to write a python library similar in function to the mwxml library but for these sql dumps. That way, whenever researchers want to extract data from these dumps, they don't have to write custom regexes / extraction pipelines but can simply open the dataset with a library and filter/extract what they need without worrying about format. You already probably wrote some custom code for this so a library would just seek to formalize it and extend it to work not just for a specific .sql.gz dump for but for any .sql.gz dump.

As far as your application, you do not need to indicate any specifics about what functions would be in this library etc. You don't even need to know how to develop libraries like this -- that's what we mentors are here for :) But you can let us know your interest level and be aware that this could be part of the work you do in the internship.

Hey @Isaac and @srodlund,
On the outreachy page, I am trying to add links to my past contributions but they are getting pasted as normal text.
Please tell me how to format them as hyperlinks?
Thanks :)

Please tell me how to format them as hyperlinks?

@Palak199 Not sure but plaintext is completely fine. Thanks

@Palak199 Not sure but plaintext is completely fine. Thanks

Okay! thank you so much

Just a heads up in case you're unaware:

The final deadline has been shifted to a later date due to the potential impact of Covid-19 on applicants (especially but not restricted to India). The final application deadline is now extended to Monday, May 3 at 4pm UTC.

Please do not hesitate to reach out to Sarah and/or me directly via email if you have further questions or are personally impacted by Covid-19. Most importantly, please stay safe and healthy.