Page MenuHomePhabricator

Improve ORES articlequality feature extraction for images
Open, MediumPublic

Description

Currently, ORES wp10 uses an 'image_links' feature that counts only images in an article if they use the normal [[File:Something.jpg]] markup format. Images withing gallery tags are not counted, and lead images in infoboxes are not counted.

In both cases, counting those images would be a better reflection of the image count, in terms of article quality.

See for example this article: https://en.wikipedia.org/wiki/The_Appearance_of_Christ_Before_the_People

ORES thinks it has one image, instead of the six it actually has: https://ores.wmflabs.org/v3/scores/enwiki/?models=wp10&revids=782180234&features

Event Timeline

Ragesoss created this task.Nov 17 2017, 5:59 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
awight renamed this task from Improve ORES wp10 feature extraction for images to Improve ORES articlequality feature extraction for images.Sep 26 2018, 6:41 PM
Harej triaged this task as Medium priority.Apr 9 2019, 9:11 PM
Harej moved this task from Research & analysis to New development on the Scoring-platform-team board.
Harej moved this task from New development to Ready to go on the Scoring-platform-team board.

Hello @Harej This is something I'll like to work on.

Halfak removed a subscriber: Harej.Dec 10 2019, 10:01 PM

Awesome! Welcome! Are familiar with all of the interesting variants of Wiki markup that might imply an image? I think listing out all of the variants is a good place to start

I can do some research on that. Thanks for the swift reply @Halfak I'll get back to you.

If you get on IRC, join us in #wikimedia-ai on freenode. My team hangs out there while we work.

It looks like we define image links here: https://github.com/wikimedia/articlequality/blob/master/articlequality/feature_lists/enwiki.py#L64

I think we'll want to develop some strategy for processing <gallery> tags and anything else you find. That will probably need to go in our base framework somewhere around here: https://github.com/wikimedia/revscoring/blob/master/revscoring/features/wikitext/datasources/parsed.py#L14

We use mwparserfromhell (docs). I couldn't find an obvious way it handles <gallery> at a glance.

I saw this https://en.wikipedia.org/wiki/Wikipedia:Extended_image_syntax I think it has all of the extended Wiki markups for images.

From a quick scan of the page, it looks like there are two syntaxes:

  • [[File:Something.jpg|...]]
  • <gallery> ... </gallery>

From the notes that @Ragesoss gave us, we'll also want to capture images that appear in infoboxes. This can get interesting. Here's a chunk of markup from Accession of Albania to the European Union:

{{Infobox EU accession bid
| logo                =  
| status              = Candidate
| nation              = Albania
| national_denonym    = Albanian
| map                 = European Union Albania Locator.svg
...

We probably want to look for specific parameters of infoboxes and accept a set of file extensions.

Parameters:

  • "image", "map", "file", etc.

Extensions:

  • "jpg", "png", "svg", etc.

Great. It's all coming together in my head now. How do I get to test my code changes though, to ensure that they work. I'm about to assign myself this task and commence.

I'd add tests here: https://github.com/wikimedia/articlequality/blob/master/articlequality/feature_lists/tests/test_enwiki.py

I think you'll be able to get gallery tags by scanning and processing: revscoring.features.wikitext.revision.datasources.tags

Oh. I think "test" is the wrong word here. I meant "run". So how do I run the code and see my changes in action?

@Halfak I'll like to assign myself this task. Can I go ahead?

HAKSOAT claimed this task.Feb 11 2020, 8:01 PM

Hello @Halfak I think this task requires PRs on the revscoring and articlequality repos.

I just opened one for the revscoring repo: https://github.com/wikimedia/revscoring/pull/472

Halfak added a comment.Mar 5 2020, 3:25 PM

I rebuilt the models. See the result in this PR: https://github.com/wikimedia/articlequality/pull/104

It doesn't look like we're seeing any meaningful performance improvement, BUT I think this is still valuable because people use the features that ORES extracts in their own analysis.

I think its valuable too. I hope the performance didn't drop either?

No drop. I think it's good to merge.

Halfak added a comment.Mar 5 2020, 7:16 PM

@Ragesoss, any thoughts as the original filer?