Outreachy Project (Round 24): Build Python library to work with html-dumps
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MGerlach
	Feb 21 2022, 5:05 PM

Description

Brief summary

When analyzing Wikipedia’s content for a research project or training large language models, researchers typically use the publicly available Wikimedia database dumps. These contain, for example, the content of every Wikipedia article in each of the over 300 language versions. For example, the February-2022 snapshot of the English Wikipedia is contained in: enwiki-20220201-pages-articles-multistream.xml.bz2. The content of Wikipedia articles is written in a mark-up language called wikitext that the mediawiki-software translates into HTML to be displayed to readers. Researchers can either work with the raw wikitext mark-up or the parsed HTML of an article but most work with the wikitext because it has long been accessible via the dumps. However, working with the wikitext has several drawbacks:

Parsing of the wikitext is not trivial. There exist some great parsers such as mwparserfromhell which make this task a lot easier. But there are still some known issues in correctly parsing the wikitext, for example handling of lists, or handling of images and interwiki links. Using the Mediawiki APIs or scraping Wikipedia directly for the HTML is computationally expensive at scale and discouraged for large projects.
Some elements contained in the HTML-version of the article are not readily available in the wikitext due to the use of, e.g., templates. This means that parsing only the wikitext means that researchers might ignore important content which is displayed to readers. For example, Mitrevski et al. found for English Wikipedia that from the 475M internal links in the HTML-versions of the articles, only 171M (36%) were present in the wikitext (see the paper for more details around the important differences between wikitext and HTML-versions of Wikipedia articles)

Thus, in general, it is often desirable to work with an HTML-version of the dumps instead of using the wikitext. Fortunately, very recently the Wikimedia Enterprise HTML dumps have been introduced and made publicly available with regular monthly updates so that researchers may use them in their work.

Therefore, the aim of this project is to write a Python library to efficiently parse the HTML-code of an article from the Wikimedia Enterprise dumps to extract relevant elements such as text, links, templates, etc. This will lower the technical barriers to work with the HTML-dumps and empower researchers and others to take advantage of this beneficial resource. In addition, the tool might solve some of the long-standing issues when parsing wikitext due to the additional structure contained in the HTML-code. The library will be integrated into existing set of tools to work with Wikimedia resources as part of the mediawiki-utilities (such as mwsql developed as part of a previous Outreachy project).

Specifically, the work will consist of the following (rough) phases:

Become familiar with html-dumps and common research tasks for the wikitext dumps
Write a library that provides an interface to work with html dumps and extract the most relevant features from an article
Write documentation for the library’s functionality, provide example notebooks as tutorials
Perform analysis on differences to output from wikitext-dumps

Skills required

Familiarity with Python3, HTML, JSON
Jupyter notebooks
Technical documentation
Some curiosity for data-science/research questions

Possible mentor(s)

@MGerlach , @Isaac

Microtasks

See application task T302242

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		MGerlach	T302237 Outreachy Project (Round 24): Build Python library to work with html-dumps
		Resolved		MGerlach	T302242 Outreachy Application Task (Round 24): Build Python library to work with html-dumps

Event Timeline

MGerlach created this task.Feb 21 2022, 5:05 PM

MGerlach created this object with visibility "acl*outreachy-mentors (Project)".

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 21 2022, 5:05 PM

Isaac mentioned this in T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.Feb 21 2022, 7:24 PM

srishakatux moved this task from Backlog to Featured Projects on the Outreachy (Round 24) board.Feb 21 2022, 9:05 PM

MGerlach updated the task description. (Show Details)Feb 22 2022, 7:05 AM

MGerlach mentioned this in T279519: Add a link: algorithm improvements: Avoid recommending links in sections that usually don't have links.Feb 22 2022, 3:23 PM

@MGerlach @Isaac Also, upload the project on the Outreachy site whenever you feel ready, and I will then approve. Thank you!

In T302237#7729921, @srishakatux wrote:

@MGerlach @Isaac Also, upload the project on the Outreachy site whenever you feel ready, and I will then approve. Thank you!

Thanks for the reminder. I uploaded the project on the Outreachy site. Please let me know if anything is missing or you need additional information. Thanks!

srishakatux changed the visibility from "acl*outreachy-mentors (Project)" to "Public (No Login Required)".Mar 25 2022, 5:33 PM

Antima_Dwivedi subscribed.Mar 25 2022, 6:37 PM

peacelovingahmed subscribed.Mar 25 2022, 9:29 PM

Hello @MGerlach and @Isaac. Trust you both are doing great.

I am Olawale Ahmed, fresh CS Grad with some experience using Python.

Please, how do we contribute to this project?

AnuSharma1729 subscribed.Mar 26 2022, 2:58 AM

AnuSharma1729 added a comment.Mar 26 2022, 3:03 AM

This comment was removed by AnuSharma1729.

Hi all. If you would like to contribute to this project during the application period, please see the application task T302242.
Dont hesitate to ask questions there. I will try to answer open questions but feel free to help each other out too.

Hey @Isaac and @MGerlach!

Hope you are doing well.

This side Shivani Sangwan, an outreachy applicant. I have good experience with Python and believe would be able to give my best to the project. I wanted to start by working on the microtasks that you have suggested so I could gain knowledge of the codebase.

Radhika_Saini subscribed.Mar 26 2022, 7:45 PM

Hey @Isaac and @MGerlach!

Hope you are doing well.

This is Radhika Saini, an outreachy applicant. I have good experience with Python, Jupyter notebooks, HTML and Really looking forward to being able to make meaningful contributions to this project. Can you please guide me on where we can begin or what to look into first?

Robot_Jelly subscribed.Mar 27 2022, 9:44 AM

In T302237#7808443, @Radhika_Saini wrote:

Hey @Isaac and @MGerlach!

Hope you are doing well.

This is Radhika Saini, an outreachy applicant. I have good experience with Python, Jupyter notebooks, HTML and Really looking forward to being able to make meaningful contributions to this project. Can you please guide me on where we can begin or what to look into first?

Hi. If you would like to contribute to this project during the application period, please see the application task T302242.

Talika2002 subscribed.Mar 27 2022, 5:09 PM

Appledora subscribed.Mar 27 2022, 5:39 PM

Siddharth628 subscribed.Mar 27 2022, 6:10 PM

Umak1106 subscribed.Mar 28 2022, 6:51 AM

Chestnnut subscribed.Mar 28 2022, 2:34 PM

Hey everyone!
I am Dhruvee Birla, an undergrad studying in IIIT, Hyderabad, India. I am pursuing an integrated degree which is Bachelor of Technology in Computer Science and Masters of Science by Research in Computing and Human Sciences. I am proficient in python programming and have been working with libraries including Matplotlib, Pandas, Numpy, Scrapy, Scikit-learn, and PyGame.
I am looking forward to contributing to this project with you all and increasing my knowledge throughout this journey.

Welcome all and thanks for introducing yourselves! Good luck with T302242 and keep the questions coming there!

JoyceAnnieGeorge subscribed.Mar 29 2022, 7:12 AM

Shriya_Agrawal subscribed.Mar 29 2022, 8:20 AM

Shriya_Agrawal unsubscribed.

Shriya_Agrawal subscribed.

Bhanu_sree22 subscribed.Mar 29 2022, 7:22 PM

Apezoidal subscribed.Mar 29 2022, 7:59 PM

Hi everyone!
I am Sejal Singh and I am an outrecahy applicant and I am really excited to contribute and looking forward to the opportunity

Dauinh01 subscribed.Mar 31 2022, 4:29 AM

Duhshubhs subscribed.Mar 31 2022, 10:26 AM

Hey everyone! Hope everyone is keeping safe.
I am Shubhs, an Outreachy applicant. I am a math graduate with good experience in data analysis and python. I am excited to step into open source projects and looking forward to contribute to this one!

DiyaAhuja subscribed.Apr 3 2022, 7:37 PM

ChilledWater subscribed.Apr 4 2022, 12:24 PM

Hello everyone!

I hope that all of you are doing great.
This is Saumya, from IIT Roorkee, India.
Due to some unavoidable circumstances, I had to start a bit later compared to my other fellow applicants - I am sorry for that.
I really hope that in the remaining contribution period, I get to learn as much as possible from this project.

Warm Regards
Thank you

Welcome newer applicants -- still plenty of time and glad to see your interest!

Hi Everyone,

My name is Fatima Arshad. I completed my Software Engineering degree from NUST and Masters in CS from LUMS, Pakistan. I am currently working as a Data Scientist.
I hope I prove to be useful for this project.

Arfat2396 subscribed.Apr 11 2022, 1:23 PM

Hello everyone! I am Diya Ahuja, a computer science undergraduate from IIIT Delhi, India. I am sorry I am starting pretty late because of some unforeseen circumstances. I am proficient in Python and HTML and have worked with Jupyter Notebooks. I hope to learn and make the most of the remaining contribution period. Thank You.

Hello Everyone,

I'm Abdelrahman Nawar, a senior Computer Engineering student and I would love to work on this project,

@Isaac am I too late or should I hop on the Micro-task and start working immediately ?

Welcome to all the newer applicants. There is still some time left for the microtask (i.e. the application task T302242). The deadline for the final application is April 22. See T302242#7840521 for some additional comments.

Inny_05 subscribed.Apr 14 2022, 3:38 AM

RXie5436 subscribed.Apr 14 2022, 6:13 PM

Hello everyone! My name is Rachel Xie, and I am a second year undergraduate Computer Science student with good knowledge of Python. I know I'm quite late to this application, and I sincerely apologize for that. However, I'm still very interested in contributing. I hope I get to learn a lot from this alongside everyone!

AfiMaame subscribed.Apr 15 2022, 1:14 PM

Jeffreysjtang subscribed.Apr 18 2022, 7:16 PM

Hello everyone! My name is Jeffrey Tang. My background is in Computational Biology. I worked with Python and R for the last 3 years and am quite familiar with Jupyter Notebooks. I am terribly late to this application as I only found out I was eligible for the internship this weekend. Apologies about this... I am looking forward to learning and contributing to this project.

Hi all,
just a reminder: if you have not done already, dont forget to submit your final application on the outreachy website before the deadline on Friday, April 22 at 4pm UTC (a little bit more than 3 days from when I am posting this).
Even if you sent your notebook to Isaac or me for feedback during the past weeks (thanks for anyone who shared their progress), you still need to submit the application on the outreachy-site. Please also make sure to include the public link to your notebook (see the documentation for how to get the public link).

Thanks for all the great contributions and discussions.

Chestnnut unsubscribed.Apr 19 2022, 12:23 PM

SamanviPotnuru subscribed.Apr 20 2022, 7:17 PM

MGerlach closed subtask T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps as Resolved.May 23 2022, 8:14 AM

leila awarded a token.May 25 2022, 10:30 PM

The Content-Transform-Team maintains the HTML format specifications (informally "Parsoid HTML" as opposed to the HTML currently displayed on the web site), and may be a useful resource for questions about (eg) how templates are represented in the HTML dump. Without distracting too much, the following projects might be an inspiration for how an "easy to use" API might look:

The Kiwix project also uses "Parsoid HTML" format dumps: https://www.kiwix.org/en/

cscott mentioned this in T182351: Make HTML dumps available.Jun 17 2022, 1:34 PM

kostajh subscribed.Jun 21 2022, 9:33 AM

Closing the task as the internship finished.

As part of the internship we built mwparserfromhtml, a python-library to parse the Wikipedia HTML-dumps. You can find the code, more details, and how to use it on gitlab: https://gitlab.wikimedia.org/repos/research/html-dumps

kostajh awarded a token.Aug 26 2022, 4:02 PM

srishakatux awarded a token.Sep 22 2022, 9:23 PM

Outreachy Project (Round 24): Build Python library to work with html-dumpsClosed, ResolvedPublicActions