Replicate Article Quality Training Notebook
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Isaac
	Mar 22 2024, 7:32 PM

Description

The evaluation of the quality model was replicated and extended in T360576. This task focuses on the replication of the training of the model (so taking a step backward): https://public-paws.wmcloud.org/User:Isaac%20(WMF)/Quality/Quality_Model_Training.ipynb

Final notebook: https://public-paws.wmcloud.org/User:DJames-WMF/Quality_Model_Training.ipynb

1. Getting started

Fork the notebook (directions) and add to your PAWS directory
Try to understand the existing code -- make sure you know what each function is accomplishing and ask questions where it's less clear. Some of the code will be familiar from the model evaluation but some will be new.

2. Add in Chinese support

Incorporate the function for including Chinese Wikipedia data that was created in T360576:

The translation map code can be added with the other language functions in the fetching data section
zhwiki should be added to the data-loading configuration so it's included in the model training
You'll also have to add in the category/media prefixes used in Chinese Wikipedia which can be found here and here and are used in the Feature extraction section of the notebook

3. Replication

Run the code from start-to-finish!
In markdown at the top of the notebook, write a summary that includes your new model coefficients and how similar they are to the original feature weights as well as what I got when I created the notebook (I never made a summary myself but you can see them in the cell outputs towards the end of the notebook). NOTE: the negative coefficient that I got for categories is non-desirable but not super surprising. We can discuss when I get back.

4. Optional explorations

I made basic scatter plots of predicted scores and true scores (similar to the evaluation). The color-coding (blue - yellow) leaves it highly unclear as to what color maps to which Wikipedia language edition. If you have time, you could see how to generate a clearer legend that gives each language edition a clear color category that is labeled in a legend. Probably something like this function -- seaborn is a non-standard library but you can install it like I did for wmpaws at the top of the notebook.
We identified that Nepalese Wikipedia (newiki) also uses PageAssessments. We could also incorporate that into training/evaluation by following the steps for Chinese Wikipedia (T360576). One thing to be aware of: Nepalese Wikipedia does not seem to have a content assessment rubric however, so this might be a bit trickier. Hopefully Google Translate + the labels in the database might be sufficient to get a reasonable dataset. It's also a smaller language so I'd consider extending the sample dates to be more than 28 days (possibly do a full year).

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		Isaac	T360572 Extend Article Quality Model to use HTML
		Resolved		DJames-WMF	T360815 Replicate Article Quality Training Notebook

Event Timeline

Isaac created this task.Mar 22 2024, 7:32 PM

Restricted Application added a subscriber: Stang. · View Herald TranscriptMar 22 2024, 7:32 PM

DJames-WMF updated the task description. (Show Details)Mar 26 2024, 3:19 PM

DJames-WMF updated the task description. (Show Details)Mar 26 2024, 3:28 PM

DJames-WMF updated the task description. (Show Details)Apr 2 2024, 12:07 PM

DJames-WMF updated the task description. (Show Details)Apr 2 2024, 12:09 PM

DJames-WMF updated the task description. (Show Details)

DJames-WMF updated the task description. (Show Details)Apr 2 2024, 12:26 PM

Isaac mentioned this in T361623: Swap out wikitext for HTML in training quality model.Apr 2 2024, 4:32 PM

Isaac moved this task from Backlog to FY2023-24-Research-April-June on the Research board.Apr 2 2024, 6:20 PM

Isaac edited projects, added Research (FY2023-24-Research-April-June); removed Research.

DJames-WMF updated the task description. (Show Details)Apr 3 2024, 4:18 AM

Excellent work @DJames-WMF ! Took a readthrough of your notebook and everything looked good. Closing this as resolved. We didn't pursue the Nepalese Wikipedia extension but that's okay -- we can always come back to it later. For now, I'd like to progress to the HTML work that you've started in T361623.

Isaac updated the task description. (Show Details)Apr 3 2024, 4:30 PM

Replicate Article Quality Training NotebookClosed, ResolvedPublicActions