Incorporate in new HTML features to quality model
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	Isaac
	May 2 2024, 3:03 PM

Description

Once we have made the switch from wikitext-based features to HTML-features, we can also start thinking about what additional features we might incorporate into the model! There are three easy ones:

Number of unique sources
Presence of infobox
Presence of messagebox

Others?

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		Isaac	T360572 Extend Article Quality Model to use HTML
		Open		DJames-WMF	T364014 Incorporate in new HTML features to quality model

Event Timeline

Isaac created this task.May 2 2024, 3:03 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 2 2024, 3:03 PM

Isaac mentioned this in T360572: Extend Article Quality Model to use HTML.May 3 2024, 1:34 PM

Isaac reassigned this task from Isaac to DJames-WMF.May 17 2024, 3:26 PM

DJames-WMF updated the task description. (Show Details)May 17 2024, 7:04 PM

Thanks @DJames-WMF for your first pass (notebook)! Next steps now that we have some initial findings:

You're extracting message-boxes/infoboxes correctly but you'll update the code to make sure they're retained in the final model as well.
You can trim out some of the old summary at the top of the notebook to focus on just this expanded model and its coefficients (with a link back to Copy-2 where folks can find more details about the earlier iterations).
What to do about this mysterious negative sources coefficient?
- I've been wondering how consistent the different languages are. One way to test this would be to train a separate model for each language and see how similar the coefficients are. This might help identify features that are less stable and worth investigating. Perhaps some insight on the sources feature or maybe even some of the others too. I expect the magnitudes might vary a bit but if any switch from positive to negative (or vice versa), that would be the interesting sign to pay attention to.
- In these sorts of models, each feature coefficient is the impact of that feature when all other features are held constant. Because of this dependence on other features, the coefficient for a feature like sources doesn't just depend on its relationship to quality but also on its relationship to all the other features. I think this is what is going on. This (quite long) tutorial on linear model coefficients/interpretation/fine-tuning has some useful examples of this and ways to chart things out. In particular, I would love to see similar charts for our data/model to:
  - The pairplots in this section, which will help us see which coefficients are correlated with each other.
  - The coefficient variability plot in this section, which will help us see which model coefficients are unstable.
  - If it turns out that sources is highly correlated with the other features, we might have to take it out. It might also be possible to e.g., switch it to a simpler boolean (>= 5 sources) and that that would help reduce the correlation while still retaining most of the benefits of counting unique sources separately from references.

Incorporate in new HTML features to quality modelOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Incorporate in new HTML features to quality model
Open, Needs TriagePublic
Actions

Related Objects
Search...