Page MenuHomePhabricator

[medium] Develop good practices for Wikisource processing
Open, Needs TriagePublic

Description

Overview

Wikisource is a Wikimedia project with over 70 different language editions that functions as an online digital library of free-content textual sources with the ability to support collaboratively-edited transcriptions of scanned documents, books, etc. This makes it both a rich source of text and in particular one that has the ability to digitize previously inaccessible texts (ex:Balinese palm leaf manuscripts) and expand the diversity and quantity of sources available to Wikipedians.

Researchers have paid less attention to Wikisource, however, even as they make extensive use of data and content from Wikipedia, Wikimedia Commons, and Wikidata. While the Wikisource dumps (ex: english) are in the same format as those used by Wikipedia, this is likely in part to a lack of Wikisource-specific tooling and examples of using Wikisource in research.

Task

You will develop a Python notebook tutorial showing how to work with Wikisource content dumps and extract and preprocess content for use in natural language processing research (or other domains). This notebook will be hosted on PAWS and has the following goals:

  • Showcase options for filtering to different types of content in Wikisource -- i.e. which namespaces and pages contain content?
  • Demonstrate how to extract natural language text from the Wikisource wikitext -- i.e. removing various types of markup.
  • Stretch goal: show how to link documents with their index and author pages

This task is considered [medium]. In general, it's expected that the task will take a a month or two of consistent work and is a good fit for someone with some research experience or interest in being involved in research. The actual time needed, however, will depend greatly on your level of experience.

Rationale

For researchers who are interested (or would benefit from) working with Wikisource, there is little in the way of examples to build on. This makes for a high barrier to entry and a simple tutorial would greatly improve the accessibility of Wikisource as a resource for research. The hope is that this would also bring further attention to Wikisource.

Recommended Skills

  • This task will require some basic understanding of Python.
  • No prior experience with Wikisource is necessary but the first stage of this work will then be familiarizing yourself with the project.
  • Additional experience with the following is helpful but not necessary:
    • Jupyter Notebooks
    • Tutorial design and good coding / documentation style
    • Wikisource: no prior familiarity is necessary but the first stage of this work will then be familiarizing yourself with the project.

Acceptance Criteria

  • The output of this task will be a public Python notebook hosted on PAWS (similar example).
  • The notebook should meet the task goals / rationale laid out above. You likely won't know everything that is needed to complete the tutorial but @Isaac will be able to help you with completing those sections.
  • The notebook should be open-licensed (see example).

Process

  • This task is currently reserved -- please do not assign it to yourself without asking @Isaac first.
  • When you would like feedback on the notebook, include a public copy of your current notebook and let @Isaac know so that he can take a look.
  • In the future, if changes are needed, a copy of your notebook may be created and edited, but a link back to the original and acknowledgment of your work will always be kept with the current version.
  • Generally, @Isaac will be able to answer any questions about the task and try to respond quickly when clarification is necessary but response times may be slow if help is needed for more general debugging etc.

Resources

Event Timeline

Isaac renamed this task from Develop Python tutorial for Article Topic Dataset to [short] Develop Python tutorial for Article Topic Dataset.Jul 27 2021, 3:22 PM
Isaac updated the task description. (Show Details)
Isaac renamed this task from [short] Develop Python tutorial for Article Topic Dataset to [medium] Develop good practices for Wikisource processing.May 12 2022, 3:02 PM
Isaac updated the task description. (Show Details)
Isaac removed a subscriber: Pablo.