Page MenuHomePhabricator

Outreachy Project (Round 24): Build Python library to work with html-dumps
Closed, ResolvedPublic

Description

Brief summary

When analyzing Wikipedia’s content for a research project or training large language models, researchers typically use the publicly available Wikimedia database dumps. These contain, for example, the content of every Wikipedia article in each of the over 300 language versions. For example, the February-2022 snapshot of the English Wikipedia is contained in: enwiki-20220201-pages-articles-multistream.xml.bz2. The content of Wikipedia articles is written in a mark-up language called wikitext that the mediawiki-software translates into HTML to be displayed to readers. Researchers can either work with the raw wikitext mark-up or the parsed HTML of an article but most work with the wikitext because it has long been accessible via the dumps. However, working with the wikitext has several drawbacks:

  • Parsing of the wikitext is not trivial. There exist some great parsers such as mwparserfromhell which make this task a lot easier. But there are still some known issues in correctly parsing the wikitext, for example handling of lists, or handling of images and interwiki links. Using the Mediawiki APIs or scraping Wikipedia directly for the HTML is computationally expensive at scale and discouraged for large projects.
  • Some elements contained in the HTML-version of the article are not readily available in the wikitext due to the use of, e.g., templates. This means that parsing only the wikitext means that researchers might ignore important content which is displayed to readers. For example, Mitrevski et al. found for English Wikipedia that from the 475M internal links in the HTML-versions of the articles, only 171M (36%) were present in the wikitext (see the paper for more details around the important differences between wikitext and HTML-versions of Wikipedia articles)

Thus, in general, it is often desirable to work with an HTML-version of the dumps instead of using the wikitext. Fortunately, very recently the Wikimedia Enterprise HTML dumps have been introduced and made publicly available with regular monthly updates so that researchers may use them in their work.

Therefore, the aim of this project is to write a Python library to efficiently parse the HTML-code of an article from the Wikimedia Enterprise dumps to extract relevant elements such as text, links, templates, etc. This will lower the technical barriers to work with the HTML-dumps and empower researchers and others to take advantage of this beneficial resource. In addition, the tool might solve some of the long-standing issues when parsing wikitext due to the additional structure contained in the HTML-code. The library will be integrated into existing set of tools to work with Wikimedia resources as part of the mediawiki-utilities (such as mwsql developed as part of a previous Outreachy project).

Specifically, the work will consist of the following (rough) phases:

  • Become familiar with html-dumps and common research tasks for the wikitext dumps
  • Write a library that provides an interface to work with html dumps and extract the most relevant features from an article
  • Write documentation for the library’s functionality, provide example notebooks as tutorials
  • Perform analysis on differences to output from wikitext-dumps

Skills required

  • Familiarity with Python3, HTML, JSON
  • Jupyter notebooks
  • Technical documentation
  • Some curiosity for data-science/research questions

Possible mentor(s)

@MGerlach , @Isaac

Microtasks

See application task T302242

Event Timeline

MGerlach created this object with visibility "acl*outreachy-mentors (Project)".

@MGerlach @Isaac Also, upload the project on the Outreachy site whenever you feel ready, and I will then approve. Thank you!

@MGerlach @Isaac Also, upload the project on the Outreachy site whenever you feel ready, and I will then approve. Thank you!

Thanks for the reminder. I uploaded the project on the Outreachy site. Please let me know if anything is missing or you need additional information. Thanks!

srishakatux changed the visibility from "acl*outreachy-mentors (Project)" to "Public (No Login Required)".Mar 25 2022, 5:33 PM

Hello @MGerlach and @Isaac. Trust you both are doing great.

I am Olawale Ahmed, fresh CS Grad with some experience using Python.

Please, how do we contribute to this project?

Hi all. If you would like to contribute to this project during the application period, please see the application task T302242.
Dont hesitate to ask questions there. I will try to answer open questions but feel free to help each other out too.

Hey @Isaac and @MGerlach!

Hope you are doing well.

This side Shivani Sangwan, an outreachy applicant. I have good experience with Python and believe would be able to give my best to the project. I wanted to start by working on the microtasks that you have suggested so I could gain knowledge of the codebase.

Hey @Isaac and @MGerlach!

Hope you are doing well.

This is Radhika Saini, an outreachy applicant. I have good experience with Python, Jupyter notebooks, HTML and Really looking forward to being able to make meaningful contributions to this project. Can you please guide me on where we can begin or what to look into first?

Hey @Isaac and @MGerlach!

Hope you are doing well.

This is Radhika Saini, an outreachy applicant. I have good experience with Python, Jupyter notebooks, HTML and Really looking forward to being able to make meaningful contributions to this project. Can you please guide me on where we can begin or what to look into first?

Hi. If you would like to contribute to this project during the application period, please see the application task T302242.

Hey everyone!
I am Dhruvee Birla, an undergrad studying in IIIT, Hyderabad, India. I am pursuing an integrated degree which is Bachelor of Technology in Computer Science and Masters of Science by Research in Computing and Human Sciences. I am proficient in python programming and have been working with libraries including Matplotlib, Pandas, Numpy, Scrapy, Scikit-learn, and PyGame.
I am looking forward to contributing to this project with you all and increasing my knowledge throughout this journey.

Welcome all and thanks for introducing yourselves! Good luck with T302242 and keep the questions coming there!

Hi everyone!
I am Sejal Singh and I am an outrecahy applicant and I am really excited to contribute and looking forward to the opportunity

Hey everyone! Hope everyone is keeping safe.
I am Shubhs, an Outreachy applicant. I am a math graduate with good experience in data analysis and python. I am excited to step into open source projects and looking forward to contribute to this one!

Hello everyone!

I hope that all of you are doing great.
This is Saumya, from IIT Roorkee, India.
Due to some unavoidable circumstances, I had to start a bit later compared to my other fellow applicants - I am sorry for that.
I really hope that in the remaining contribution period, I get to learn as much as possible from this project.

Warm Regards
Thank you

Welcome newer applicants -- still plenty of time and glad to see your interest!

Hi Everyone,

My name is Fatima Arshad. I completed my Software Engineering degree from NUST and Masters in CS from LUMS, Pakistan. I am currently working as a Data Scientist.
I hope I prove to be useful for this project.

Hello everyone! I am Diya Ahuja, a computer science undergraduate from IIIT Delhi, India. I am sorry I am starting pretty late because of some unforeseen circumstances. I am proficient in Python and HTML and have worked with Jupyter Notebooks. I hope to learn and make the most of the remaining contribution period. Thank You.

Hello Everyone,

I'm Abdelrahman Nawar, a senior Computer Engineering student and I would love to work on this project,

@Isaac am I too late or should I hop on the Micro-task and start working immediately ?

Welcome to all the newer applicants. There is still some time left for the microtask (i.e. the application task T302242). The deadline for the final application is April 22. See T302242#7840521 for some additional comments.

Hello everyone! My name is Rachel Xie, and I am a second year undergraduate Computer Science student with good knowledge of Python. I know I'm quite late to this application, and I sincerely apologize for that. However, I'm still very interested in contributing. I hope I get to learn a lot from this alongside everyone!

Hello everyone! My name is Jeffrey Tang. My background is in Computational Biology. I worked with Python and R for the last 3 years and am quite familiar with Jupyter Notebooks. I am terribly late to this application as I only found out I was eligible for the internship this weekend. Apologies about this... I am looking forward to learning and contributing to this project.

Hi all,
just a reminder: if you have not done already, dont forget to submit your final application on the outreachy website before the deadline on Friday, April 22 at 4pm UTC (a little bit more than 3 days from when I am posting this).
Even if you sent your notebook to Isaac or me for feedback during the past weeks (thanks for anyone who shared their progress), you still need to submit the application on the outreachy-site. Please also make sure to include the public link to your notebook (see the documentation for how to get the public link).

Thanks for all the great contributions and discussions.

The Content-Transform-Team maintains the HTML format specifications (informally "Parsoid HTML" as opposed to the HTML currently displayed on the web site), and may be a useful resource for questions about (eg) how templates are represented in the HTML dump. Without distracting too much, the following projects might be an inspiration for how an "easy to use" API might look:

The Kiwix project also uses "Parsoid HTML" format dumps: https://www.kiwix.org/en/

MGerlach claimed this task.

Closing the task as the internship finished.

As part of the internship we built mwparserfromhtml, a python-library to parse the Wikipedia HTML-dumps. You can find the code, more details, and how to use it on gitlab: https://gitlab.wikimedia.org/repos/research/html-dumps