Page MenuHomePhabricator

Extend Article Quality Model to use HTML
Open, Needs TriagePublic

Description

This task covers the internship project to convert the wikitext-based article quality model to use Parsoid HTML as its source instead. This will have several benefits:

This task is an umbrella task for the individual steps in this project.

Event Timeline

@DJames-WMF has made progress on converting the wikitext over to HTML features. We're finding that the old normalization values -- e.g., how many references are expected in a top-quality article for a given wiki -- are no longer well-aligned for a few features. This seems to be most relevant for page-length which then affects wikilinks and references as well. I'll need to look into re-generating these normalization values. A few options:

  • Use the APIs to fetch HTML for a random sample of articles to re-calibrate the values. Sample size could be a challenge though because we're looking at the 95th percentile so we need a large enough sample for that to be stable.
  • Slowly loop through the whole Enterprise HTML dump -- this would take a very long time and in my experience the article ordering is not random so we can't stop early unfortunately without biasing the result.
  • Load a snapshot of the HTML dumps for the relevant languages into HDFS and process in parallel -- this is probably the most sensible solution because then we can re-use the data if we ever need to come back and recompute a value.

Weekly updates:

  • I started on loading the HTML dumps into HDFS (code courtesy of Fabian) -- this is working well and I tested with Arabic and was quite happy with how quickly it processed. Though loading in English is taking some time...
  • Destinie is working out some kinks in our HTML features

Weekly updates:

  • New normalization values generated! code and values.
  • Destinie has made good progress on incorporating these into the notebook and I think we're essentially in a place where we can begin to add new features to the model. I put together a task for that (T364014). She also has begun work on improving some of the mwparserfromhtml documentation (MRs) so she'll have a better sense of what the library can do re: potential new features.

Weekly updates:

  • Sources is proving tricky as a feature so I left some ideas for how to "debug" what's going on there in: T364014#9810129
  • Documentation work is continuning! Details: T365269

Weekly updates:

  • Provided feedback on Destinie's tutorial notebook for mwparserfromhtml (MR) and was excited that her work surfaced a bug in the library (though it's my code that is buggy). Many eyes!

Weekly updates:

  • Merged Destinie's tutorial notebook MR (!) and assigned the bug as a next step
  • Put together an exploration of a different type of model (ordinal logistic regression instead of linear regression). This came out of Destinie's results and pair-plot analyses which were showing that as the number of features grew, the "effect while holding other variables constant" aspect of linear regression was leading to coefficients whose interpretation didn't match reality -- e.g., fewer sources -> higher quality. At first the thought was to go for something like Naive Bayes where the coefficients are learned independently but I could not find a model of that type that did linear regression so I went back to the drawing board to reconsider what a ordinal logistic regression model would look like. I had originally abandoned because the coefficients were less interpretable, it wasn't clear to me that I could effectively convert the class probabilities to a single point prediction between 0 and 1 (the desired output), and I was hoping to avoid the semi-complicated statsmodel dependency on LiftWing. In this revisiting, I figured out how to reproduce the model outputs without the statsmodel dependency and convert the logits generated by the model into a reasonable 0-1 range in a reproducible and not purely arbitrary way. The coefficients still are harder to interpret than the linear model ones but they're pretty straightforward (positive = good; negative = bad) and they match expectations. And an initial eval of model performance suggests that it matches or beats the linear approach. Notebook: https://public-paws.wmcloud.org/User:Isaac%20(WMF)/Quality/html-qual-exploration.ipynb#Ordinal-Logistic-Regression

Weekly updates:

  • I updated the quality API so we could get predictions from all three models under consideration (wikitext-linear-regression; HTML-linear-regression; HTML-ordinal-logistic-regression). Example: https://misalignment.wmcloud.org/api/v1/quality-revid-compare?lang=en&revid=1228403723. Destinie will now be able to use that endpoint to collect model predictions and do a final comparison to choose which is best for uploading to LiftWing!

Weekly updates:

  • None -- waiting on final model comparison before deciding what should be hosted on LiftWing. ML Platform did get an initial version on staging, which is an exciting step towards deployment!