⚓ T361623 Swap out wikitext for HTML in training quality model

		Status	Subtype	Assigned	Task
		Open		Isaac	T360572 Extend Article Quality Model to use HTML
		Resolved		DJames-WMF	T361623 Swap out wikitext for HTML in training quality model

Event Timeline

@DJames-WMF can claim and start this task when T360815 is complete.

Isaac moved this task from Backlog to FY2023-24-Research-April-June on the Research board.Apr 2 2024, 6:18 PM

Isaac edited projects, added Research (FY2023-24-Research-April-June); removed Research.

DJames-WMF updated the task description. (Show Details)Apr 3 2024, 4:28 AM

DJames-WMF updated the task description. (Show Details)

Isaac reassigned this task from Isaac to DJames-WMF.Apr 3 2024, 4:30 PM

Isaac mentioned this in T360815: Replicate Article Quality Training Notebook.

DJames-WMF updated the task description. (Show Details)Apr 9 2024, 10:26 AM

DJames-WMF updated the task description. (Show Details)Apr 10 2024, 4:33 PM

Added another step for the bug-fixing we're working on right now with 0-values for some of the features. I also unchecked the optional exploration -- that actually is separate from the notebook (it involves updating a README file in a code repository) so we can talk about it in a future meeting and decide whether to pick it up or not.

DJames-WMF updated the task description. (Show Details)Apr 11 2024, 3:59 PM

Isaac updated the task description. (Show Details)Apr 12 2024, 6:48 PM

Next steps for this notebook based on Destinie's assessment (notebook) of how well-distributed each model feature is after switching to HTML. We have three features that are poorly distributed (values all lumped together) so the model cannot learn much from them. They are:

Page length: the values are all lumped around 1 because Parsoid HTML (with all of its syntax) is far more verbose that wikitext and by definition a superset of the wikitext. We don't have any perfect way of getting back to the wikitext length but probably a more reasonable assessment of article length is how much text is in it. So instead of len(article_html), let's use the get_plaintext() function and take the length of that. That function has a bunch of settings for it to work appropriately so let's use the approach used by html_to_plaintext() in this notebook with a few small tweaks:
- Don't exclude List elements (they often have valid content from an article quality standpoint)
- Take out the if len(paragraph.strip()) > 15 clause for each paragraph (we're just counting up things so I'm okay with the occasional "weird" paragraph)
- Rather than doing the final if paragraphs: check, just use '\n'.join(paragraphs) for computing length -- this will just be an empty string (length 0) if no paragraphs.
Media: the reason they're lumping to 1s is probably because many articles have lots of little icons that aren't defined in the wikitext (transcluded via templates). These are inflating our counts of images in the article. I put together one heuristic to filter these out in the test cases and I think we can re-use that pixel-size logic here too (code). This should reduce our media counts back to where they're more evenly distributed.
Categories: Here the lumping towards 1 values likely is the result of hidden categories (usually transcluded via templates again and not in the wikitext). One way around this is to check each category returned by get_categories() to see if it was transcluded. There's an existing function in the library (example import statement) and then we can do something like len([1 for c in article.wikistew.get_categories() if not is_transcluded(c)]).

Additionally, I have another fix that I'd like to implement to bring the HTML model closer in-line with the old wikitext-based one:

Switch article.wikistew.get_references() to article.wikistew.get_citations(). The former is the number of unique sources while the latter is number of in-line citations (documentation). The citation definition is what matches the old wikitext-based approach.

@DJames-WMF -- go ahead and make a copy of your current notebook (so we preserve this first iteration) and work on these changes. Then rerun the training and reassess the feature distributions and coefficients and we can decide next steps! We can talk about some of them in more depth too (what is is_transcluded() doing?) when we meet.

DJames-WMF closed this task as Resolved.May 14 2024, 1:26 PM

DJames-WMF updated the task description. (Show Details)

Swap out wikitext for HTML in training quality model
Closed, ResolvedPublic
Actions

Description

1. Getting started

2. Switch wikitext to HTML features

3. Replicate and compare!

4. Optional explorations

Related Objects
Search...

Event Timeline

	Isaac
	Apr 2 2024, 4:32 PM

Swap out wikitext for HTML in training quality modelClosed, ResolvedPublicActions

Description

1. Getting started

2. Switch wikitext to HTML features

3. Replicate and compare!

4. Optional explorations

Related ObjectsSearch...

Event Timeline

Swap out wikitext for HTML in training quality model
Closed, ResolvedPublic
Actions

Related Objects
Search...