Page MenuHomePhabricator

Swap out wikitext for HTML in training quality model
Open, Needs TriagePublic


The training of the quality model was replicated in T360815. This task focuses on taking the step of swapping out the wikitext-based features for HTML-based features in the training of the model. It will build on the final notebook (with Chinese data included) from that previous task:

1. Getting started

  • Get familiar with mwparserfromhtml library. It does a lot of things (many of which are irrelevant to this task) but you can see a direct example of how we can replace wikitext-based feature extraction with HTML-based feature extraction for references here: You will notice that the wikitext_to_refs function is very similar to existing code for extracting references in get_article_features in your notebook. The html_to_refs replacement is even simpler then and that's what you'll be switching it to.
  • Duplicate your notebook so you can update it while retaining a copy of your previous results/code.

2. Switch wikitext to HTML features

For a given article, the notebook currently fetches its wikitext and extracts the features from it. We want to instead fetch its HTML and extract the same features from that.

  • Replace get_article_wikitext with a function called get_article_html that takes the same parameters (lang and revid). I've actually already started this elsewhere so you can re-use the code called get_article_parsoid in this notebook but that gets the current version of an article (not a specific revision ID). To get the specific revid, you'll have to instead switch to the revision-oriented API endpoint in the code. The function should return a string that is the HTML for an article revision.
  • Rewrite get_article_features to take the article HTML instead of wikitext. Each feature count that we return at the end ([page_length, refs, wikilinks, categories, media, headings]) will now need to be calculated using functions in mwparserfromhtml from the HTML.
  • Clean-up: a bunch of global variables that were used for wikitext processing can now also be removed (variables related to category/media/reference wikitext extraction). And presumably the re import can also be removed.

3. Replicate and compare!

  • Run the new code from start-to-finish!
  • In markdown at the top of the notebook, write a summary that includes your new model coefficients and how similar they are to the feature weights based on wikitext features.
  • Also incorporate a comparison of normalized feature distributions between wikitext-based model and HTML-based model. For example, for each feature, does its values range fully from 0 to 1 with many data points in the middle (good) or is most of the data either 0s or 1s (bad).
  • Identify bugs, fix, repeat!

4. Optional explorations

  • Complete this issue on updating the mwparserfromhtml documentation. This will be very helpful for future users and is a good way to get accustomed to our Gitlab infrastructure / code review process.

Event Timeline

@DJames-WMF can claim and start this task when T360815 is complete.

Added another step for the bug-fixing we're working on right now with 0-values for some of the features. I also unchecked the optional exploration -- that actually is separate from the notebook (it involves updating a README file in a code repository) so we can talk about it in a future meeting and decide whether to pick it up or not.

Next steps for this notebook based on Destinie's assessment (notebook) of how well-distributed each model feature is after switching to HTML. We have three features that are poorly distributed (values all lumped together) so the model cannot learn much from them. They are:

  • Page length: the values are all lumped around 1 because Parsoid HTML (with all of its syntax) is far more verbose that wikitext and by definition a superset of the wikitext. We don't have any perfect way of getting back to the wikitext length but probably a more reasonable assessment of article length is how much text is in it. So instead of len(article_html), let's use the get_plaintext() function and take the length of that. That function has a bunch of settings for it to work appropriately so let's use the approach used by html_to_plaintext() in this notebook with a few small tweaks:
    • Don't exclude List elements (they often have valid content from an article quality standpoint)
    • Take out the if len(paragraph.strip()) > 15 clause for each paragraph (we're just counting up things so I'm okay with the occasional "weird" paragraph)
    • Rather than doing the final if paragraphs: check, just use '\n'.join(paragraphs) for computing length -- this will just be an empty string (length 0) if no paragraphs.
  • Media: the reason they're lumping to 1s is probably because many articles have lots of little icons that aren't defined in the wikitext (transcluded via templates). These are inflating our counts of images in the article. I put together one heuristic to filter these out in the test cases and I think we can re-use that pixel-size logic here too (code). This should reduce our media counts back to where they're more evenly distributed.
  • Categories: Here the lumping towards 1 values likely is the result of hidden categories (usually transcluded via templates again and not in the wikitext). One way around this is to check each category returned by get_categories() to see if it was transcluded. There's an existing function in the library (example import statement) and then we can do something like len([1 for c in article.wikistew.get_categories() if not is_transcluded(c)]).

Additionally, I have another fix that I'd like to implement to bring the HTML model closer in-line with the old wikitext-based one:

  • Switch article.wikistew.get_references() to article.wikistew.get_citations(). The former is the number of unique sources while the latter is number of in-line citations (documentation). The citation definition is what matches the old wikitext-based approach.

@DJames-WMF -- go ahead and make a copy of your current notebook (so we preserve this first iteration) and work on these changes. Then rerun the training and reassess the feature distributions and coefficients and we can decide next steps! We can talk about some of them in more depth too (what is is_transcluded() doing?) when we meet.