Page MenuHomePhabricator

Improve documentation for mwparserfromhtml
Open, Needs TriagePublic

Description

The new quality model relies heavily on the mwparserfromhtml library but it's currently under-documented, making it difficult to know what is possible (or not) with that code. We will focus on the tutorial notebook but optionally can also update the docstrings as described below.

Tutorial Notebook

The primary task will be updating the tutorial notebook for the library. A few things to keep in mind for this:

  • Tutorials do not have to be comprehensive (i.e. goal is not to show every function) but they should have a clear focus to display a realistic way in which someone might use the library. We are going to use the article quality modeling work that Destinie has already been leading as this use-case.
  • The code should focus as much on the mwparserfromhtml aspect as possible with as little extra code as required.
  • I'd suggest using a specific Wikipedia article as an example (and providing a screenshot of it in the tutorial). You'll want something that isn't too long so that someone could easily manually count e.g., how many references they see in the article and expect the library to capture. But you also want something that has enough content that it's a good example of e.g., images vs. icons and the difference between citations and sources. You can just click through the random article generator for a while until you find something interesting or look for topics that you find to be curious. Probably something similar to this if not a bit shorter: https://en.wikipedia.org/wiki/Caladenia_xantha.
  • The work you did for redlinks (where you printed out each link name so we could inspect) is a nice way to show what the library is or is not capturing for the different features we use in the model.
  • Include lots of explanatory text about what the code is doing and why. Also helping someone who is less familiar with the features to follow along with what is being extracted by referring back to the screenshot.
  • It doesn't have to be perfect (there is no perfect tutorial) -- just a good introduction for someone new to the library and we can always iterate.

Docstrings (optional)

An optional extension is to work on improving the library's docstrings. I'd focus on the get_<element> functions in article.py as described in the linked issue but feel free to make improvements elsewhere if you see problems.