Extend Article Quality Model to use HTML
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	Isaac
	Mar 20 2024, 8:14 PM

Description

This task covers the internship project to convert the wikitext-based article quality model to use Parsoid HTML as its source instead. This will have several benefits:

More reliable feature extraction -- e.g., only 90% of references can be easily identified in wikitext (as opposed to HTML).
Extend the model with new features -- e.g., potentially valuable features like infoboxes and article maintenance messages are very difficult to extract consistently from wikitext but far easier in the HTML. Readability is another key facet that we're interested in incorporating.
Explore alternative architectures for the model.

This task is an umbrella task for the individual steps in this project.

Related Objects
Search...

Status	Assigned	Task
Open	Isaac	T360572 Extend Article Quality Model to use HTML
Open	None	T360455 Add Article Quality Model to LiftWing
Resolved	DJames-WMF	T360576 Extend evaluation data to include Chinese Wikipedia
Resolved	DJames-WMF	T360815 Replicate Article Quality Training Notebook
Resolved	DJames-WMF	T361623 Swap out wikitext for HTML in training quality model
Open	DJames-WMF	T364014 Incorporate in new HTML features to quality model
Open	DJames-WMF	T365269 Improve documentation for mwparserfromhtml

Event Timeline

Isaac created this task.Mar 20 2024, 8:14 PM

Isaac added a subtask: T360455: Add Article Quality Model to LiftWing.

Isaac closed subtask T360576: Extend evaluation data to include Chinese Wikipedia as Resolved.Mar 22 2024, 7:40 PM

Isaac moved this task from Backlog to FY2023-24-Research-April-June on the Research board.Apr 2 2024, 5:26 PM

Isaac edited projects, added Research (FY2023-24-Research-April-June); removed Research.

Isaac closed subtask T360815: Replicate Article Quality Training Notebook as Resolved.Apr 3 2024, 4:30 PM

Isaac mentioned this in T346089: Investigate Isaac article quality ML model as option .Apr 12 2024, 2:32 PM

@DJames-WMF has made progress on converting the wikitext over to HTML features. We're finding that the old normalization values -- e.g., how many references are expected in a top-quality article for a given wiki -- are no longer well-aligned for a few features. This seems to be most relevant for page-length which then affects wikilinks and references as well. I'll need to look into re-generating these normalization values. A few options:

Use the APIs to fetch HTML for a random sample of articles to re-calibrate the values. Sample size could be a challenge though because we're looking at the 95th percentile so we need a large enough sample for that to be stable.
Slowly loop through the whole Enterprise HTML dump -- this would take a very long time and in my experience the article ordering is not random so we can't stop early unfortunately without biasing the result.
Load a snapshot of the HTML dumps for the relevant languages into HDFS and process in parallel -- this is probably the most sensible solution because then we can re-use the data if we ever need to come back and recompute a value.

Weekly updates:

I started on loading the HTML dumps into HDFS (code courtesy of Fabian) -- this is working well and I tested with Arabic and was quite happy with how quickly it processed. Though loading in English is taking some time...
Destinie is working out some kinks in our HTML features

Weekly updates:

New normalization values generated! code and values.
Destinie has made good progress on incorporating these into the notebook and I think we're essentially in a place where we can begin to add new features to the model. I put together a task for that (T364014). She also has begun work on improving some of the mwparserfromhtml documentation (MRs) so she'll have a better sense of what the library can do re: potential new features.

DJames-WMF closed subtask T361623: Swap out wikitext for HTML in training quality model as Resolved.May 14 2024, 1:26 PM

Weekly updates:

Sources is proving tricky as a feature so I left some ideas for how to "debug" what's going on there in: T364014#9810129
Documentation work is continuning! Details: T365269

Weekly updates:

Provided feedback on Destinie's tutorial notebook for mwparserfromhtml (MR) and was excited that her work surfaced a bug in the library (though it's my code that is buggy). Many eyes!

Weekly updates:

Merged Destinie's tutorial notebook MR (!) and assigned the bug as a next step
Put together an exploration of a different type of model (ordinal logistic regression instead of linear regression). This came out of Destinie's results and pair-plot analyses which were showing that as the number of features grew, the "effect while holding other variables constant" aspect of linear regression was leading to coefficients whose interpretation didn't match reality -- e.g., fewer sources -> higher quality. At first the thought was to go for something like Naive Bayes where the coefficients are learned independently but I could not find a model of that type that did linear regression so I went back to the drawing board to reconsider what a ordinal logistic regression model would look like. I had originally abandoned because the coefficients were less interpretable, it wasn't clear to me that I could effectively convert the class probabilities to a single point prediction between 0 and 1 (the desired output), and I was hoping to avoid the semi-complicated statsmodel dependency on LiftWing. In this revisiting, I figured out how to reproduce the model outputs without the statsmodel dependency and convert the logits generated by the model into a reasonable 0-1 range in a reproducible and not purely arbitrary way. The coefficients still are harder to interpret than the linear model ones but they're pretty straightforward (positive = good; negative = bad) and they match expectations. And an initial eval of model performance suggests that it matches or beats the linear approach. Notebook: https://public-paws.wmcloud.org/User:Isaac%20(WMF)/Quality/html-qual-exploration.ipynb#Ordinal-Logistic-Regression

Weekly updates:

I updated the quality API so we could get predictions from all three models under consideration (wikitext-linear-regression; HTML-linear-regression; HTML-ordinal-logistic-regression). Example: https://misalignment.wmcloud.org/api/v1/quality-revid-compare?lang=en&revid=1228403723. Destinie will now be able to use that endpoint to collect model predictions and do a final comparison to choose which is best for uploading to LiftWing!

Weekly updates:

None -- waiting on final model comparison before deciding what should be hosted on LiftWing. ML Platform did get an initial version on staging, which is an exciting step towards deployment!

Extend Article Quality Model to use HTMLOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Extend Article Quality Model to use HTML
Open, Needs TriagePublic
Actions

Related Objects
Search...