When analyzing Wikipedia’s content for a research project or training large language models, researchers typically use the publicly available Wikimedia database dumps. These contain, for example, the content of every Wikipedia article in each of the over 300 language versions. For example, the February-2022 snapshot of the English Wikipedia is contained in: enwiki-20220201-pages-articles-multistream.xml.bz2. The content of Wikipedia articles is written in a mark-up language called wikitext that the mediawiki-software translates into HTML to be displayed to readers. Researchers can either work with the raw wikitext mark-up or the parsed HTML of an article but most work with the wikitext because it has long been accessible via the dumps. However, working with the wikitext has several drawbacks:
- Parsing of the wikitext is not trivial. There exist some great parsers such as mwparserfromhell which make this task a lot easier. But there are still some known issues in correctly parsing the wikitext, for example handling of lists, or handling of images and interwiki links. Using the Mediawiki APIs or scraping Wikipedia directly for the HTML is computationally expensive at scale and discouraged for large projects.
- Some elements contained in the HTML-version of the article are not readily available in the wikitext due to the use of, e.g., templates. This means that parsing only the wikitext means that researchers might ignore important content which is displayed to readers. For example, Mitrevski et al. found for English Wikipedia that from the 475M internal links in the HTML-versions of the articles, only 171M (36%) were present in the wikitext (see the paper for more details around the important differences between wikitext and HTML-versions of Wikipedia articles)
Thus, in general, it is often desirable to work with an HTML-version of the dumps instead of using the wikitext. Fortunately, very recently the Wikimedia Enterprise HTML dumps have been introduced and made publicly available with regular monthly updates so that researchers may use them in their work.
Therefore, the aim of this project is to write a Python library to efficiently parse the HTML-code of an article from the Wikimedia Enterprise dumps to extract relevant elements such as text, links, templates, etc. This will lower the technical barriers to work with the HTML-dumps and empower researchers and others to take advantage of this beneficial resource. In addition, the tool might solve some of the long-standing issues when parsing wikitext due to the additional structure contained in the HTML-code. The library will be integrated into existing set of tools to work with Wikimedia resources as part of the mediawiki-utilities (such as mwsql developed as part of a previous Outreachy project).
Specifically, the work will consist of the following (rough) phases:
- Become familiar with html-dumps and common research tasks for the wikitext dumps
- Write a library that provides an interface to work with html dumps and extract the most relevant features from an article
- Write documentation for the library’s functionality, provide example notebooks as tutorials
- Perform analysis on differences to output from wikitext-dumps
- Familiarity with Python3, HTML, JSON
- Jupyter notebooks
- Technical documentation
- Some curiosity for data-science/research questions
See application task T302242